Production System • 500+ Deployments

Zabbix Automation Modernisation

Consolidated fragmented Zabbix onboarding automation into a maintainable, production-ready tool. Automates agent deployment across hundreds of customer servers, supporting modern Linux distributions and cloud platforms.

Ansible Bash Zabbix API Linux Automation

The Problem

Manual Zabbix agent onboarding was time-consuming and error-prone. The previous workflow required:

  • SSH to customer server
  • Manually install Zabbix agent (different steps per OS)
  • Configure agent (edit config files, manage firewall rules)
  • Log into Zabbix UI and manually create host record
  • Assign monitoring templates (different templates per service type)
  • Verify connectivity and test monitoring

Time investment: 1-2 hours per customer server. With hundreds of customers, this was a scalability bottleneck.

Additional challenges:

  • Legacy automation scattered across 3 directories with duplicate YAML
  • Template bugs causing deployment failures
  • Poor support for cloud platforms (non-standard SSH users, key-based auth)
  • No firewall auto-detection—operators had to know each customer's setup
  • Manual host removal from Zabbix when customers cancelled
  • Inconsistent error handling and logging

The Solution

Modernised the automation from the ground up:

Consolidation & Code Quality

  • Single Source of Truth: Merged 3 legacy directories into unified, maintainable codebase
  • Eliminated Duplication: Removed duplicate YAML blocks and template bugs
  • Modern Ansible: Implemented proper Ansible FQCNs and best practices
  • Git-Ready: Modular design structured for version control and team collaboration

Cloud & Modern OS Support

  • Non-Root SSH Users: Support for cloud-standard accounts (rocky, ubuntu, debian, centos, cloud-user, almalinux)
  • SSH Key Authentication: Native support for key-based auth (no passwords)
  • Automatic Sudo Escalation: Handles privilege escalation transparently
  • Cloud Platform Ready: Tested and validated on OpenStack, AWS, and traditional cPanel environments

Intelligent Firewall Detection

  • Priority-Order Detection: CSF → iptables → nftables → firewalld → UFW
  • RHEL 8/9 Compatibility: Fixed compatibility issues with modern RedHat derivatives
  • Rich Rules Support: Properly handles firewalld rich rules syntax
  • Validation: Verifies rules applied correctly after configuration

Host Lifecycle Management

  • Agent Deployment: One-command onboarding with dependency checks
  • Host Removal Workflow: Interactive removal with safety confirmations
  • Two Removal Modes: API-only removal (Zabbix data preserved) or full removal (agent + Zabbix)
  • Audit Trail: Structured logging documents every action

Operational Excellence

  • Pre-Flight Checks: Validates dependencies before starting deployment
  • Structured Logging: Detailed logs with 14-day automatic cleanup
  • Command Preview: Shows what will be executed before confirming
  • Clear Error Handling: Specific error messages aid troubleshooting

Key Achievements

Metric Result
Time Saved Per Server 1-2 hours → ~5 minutes
Servers Onboarded 500+ deployments
Modern OS Failures Reduced to zero (Rocky 9, AlmaLinux 9)
Legacy Directories 3 → 1 unified codebase
Supported Linux Distros Rocky, Alma, Ubuntu, Debian, CentOS
Manual UI Work Automated (host removal via API)

Technical Architecture

Deployment Workflow

  • Pre-Flight Checks: Verify SSH access, check dependencies (curl, systemctl)
  • Agent Installation: OS-detection, install Zabbix agent (version compatible with server)
  • Configuration: Generate agent config pointing to monitoring server, set PSK/TLS if required
  • Firewall Rules: Auto-detect firewall type, add Zabbix port rules
  • Service Start: Enable and start Zabbix agent, verify running
  • API Registration: Create host in Zabbix via API, assign templates
  • Verification: Test connectivity, confirm monitoring data flowing

Firewall Intelligence

Detection Logic (Priority Order):

  • Check for CSF (Config Server Firewall) first
  • Fall back to iptables if CSF not found
  • Check for nftables
  • Check for firewalld (RHEL 7+)
  • Fall back to UFW as last resort

Each firewall type has properly formatted rule syntax, ensuring compatibility across systems.

Removal Workflow

  • Interactive Prompts: Confirm target host, removal scope
  • Two Modes: API-only (soft delete) or full (agent + Zabbix)
  • Safety Confirmations: Require explicit approval for destructive actions
  • Audit Log: Document who removed what, when, and why

Stack

  • Orchestration: Ansible 2.9+
  • Scripting: Bash for edge cases and system integration
  • APIs: Zabbix API (host creation, template assignment)
  • Version Control: Git-ready modular structure

Business Impact

  • Massive Time Savings: 1-2 hours per server → ~5 minutes. At hundreds of customers, this is a permanent reduction in operational overhead.
  • Same-Day Onboarding: New customers can now be set up for monitoring within hours instead of days.
  • Reliability: Zero modern OS failures means confidence in deployment, reduced support tickets for failed onboarding.
  • Cloud-Ready: Enabled rapid onboarding of customers on new Fabrc cloud platform without special handling.
  • Operational Continuity: Automated host removal eliminates manual Zabbix UI work for cancellations—team doesn't have to remember to clean up.
  • Scalability: Manual process was a bottleneck. Automation unlocks capacity for faster customer growth.
  • Knowledge Distribution: Single, documented playbook replaces tribal knowledge of "how to set up Zabbix"

Real-World Impact

Before: New customer onboarding required manual SSH, configuration edits, Zabbix UI navigation, and template assignment. An operator could do 2-3 per day if lucky.

Now: One command per server. Monitoring is live within minutes. Team can focus on higher-value work instead of repetitive setup.

The system has handled 500+ deployments across diverse environments (traditional bare metal, VPS, cloud platforms, different Linux distributions) with zero modern OS failures.

Lessons Learned

  • Firewall Detection is Non-Trivial: Every environment is different. Priority-order detection with fallbacks handles real-world chaos.
  • Cloud Changes Everything: Non-root SSH users and key-based auth are standard in cloud, but required explicit support.
  • API Integration Challenges: Zabbix API rate limits and version differences required careful error handling and retries.
  • Modular Design Pays Off: Splitting deployment into discrete steps means operators can troubleshoot failures step-by-step.
  • Automation Isn't "Set and Forget": Requirements change (new OS, new firewall type, new cloud platform). Built with modularity to accommodate evolution.
← Back to Projects Need automation at scale?