Understanding Nagios Architecture and Check Processing
Active vs. Passive Checks
Active checks are initiated by the Nagios engine itself, polling hosts and services on a schedule. Passive checks, in contrast, are results submitted externally via NRDP, NSCA, or custom scripts—common in environments with asynchronous event detection or federated agents.
How Check Results Are Handled
Passive check results are queued in external result files or API submissions, then picked up by Nagios's event processor. Misconfigured buffers, delayed cron-based submissions, or improper freshness thresholds can leave the system in a stale state.
define service{ host_name db-node-01 service_description Replication Status check_command check_dummy!0 passive_checks_enabled 1 active_checks_enabled 0 freshness_threshold 300 check_freshness 1 }
Symptoms and Root Causes
Common Symptoms
- Alerts remain in WARNING or CRITICAL state despite resolution
- Services marked as "Stale" or "Check Freshness Failed"
- Host/service status flapping due to inconsistent timestamps
Root Causes
- Missing or malformed passive check submissions
- Incorrect clock synchronization between agents and central server
- Event broker misconfigurations preventing check result ingestion
- Overloaded result buffer queues with delayed processing
Diagnostic Techniques
1. Validate Freshness and Timing Settings
Check `status.dat` or use the Nagios Web UI to inspect last check time and freshness threshold.
// Check freshness for a service grep -A 10 'service { host_name=db-node-01' /usr/local/nagios/var/status.dat
2. Examine External Command File
Verify that external command file permissions allow passive result submissions. Also, check for delays in NRDP or NSCA daemon log files.
// Example NRDP POST format curl -d 'token=XYZ123&cmd=submitcheck&hostname=db-node-01&servicename=Replication Status&state=0&output=OK' http://nagios-server/nrdp/
3. Sync Time Sources
Time drift between agent nodes and the central server can cause passive checks to be discarded as outdated. Ensure consistent NTP configuration across all hosts.
Step-by-Step Remediation
Step 1: Enable and Calibrate Freshness Checks
Ensure `check_freshness` is enabled, and set an appropriate `freshness_threshold`. Avoid overly aggressive thresholds that may preempt valid passive data.
Step 2: Monitor Passive Submission Pipelines
Use cron job monitoring (e.g., `check_cron`) to ensure NRDP/NSCA senders are executing regularly. Also monitor the Nagios spool directory for growth, which may indicate submission issues.
Step 3: Adjust External Command Buffering
For high-throughput environments, tune the Nagios Core settings:
command_check_interval=1s max_check_result_reaper_time=30
Step 4: Normalize Time Across Agents
Deploy NTP with stratum-aware servers to prevent skew. Time discrepancies beyond 5–10 seconds can lead to rejection of passive results.
Step 5: Audit Event Broker Modules
Ensure third-party modules (e.g., for Graphite, InfluxDB, or PagerDuty) are not disrupting check result processing. Temporarily disable brokers to isolate the issue.
Long-Term Architectural Considerations
Centralized Message Queues
Instead of relying solely on file-based or HTTP passive check submission, use message brokers like RabbitMQ or Kafka for queueing passive results. This provides durability and decouples Nagios from the transport layer.
Distributed Polling with Redundant Schedulers
Deploy redundant Nagios satellites (e.g., via Mod-Gearman) for active checks, while retaining passive submission from edge agents for hybrid resiliency.
Alert Normalization and Suppression
Implement correlation engines or alert suppression logic to prevent flapping caused by stale data. Tools like Nagios Reactor or custom event handlers can help.
Conclusion
Ghost alerts and stale passive check results are symptoms of deeper synchronization and configuration issues in Nagios-based monitoring. By tuning freshness settings, verifying time synchronization, and decoupling submission pipelines, teams can build a robust and scalable monitoring infrastructure. Proactive architecture and diagnostics are key to keeping Nagios effective in modern DevOps environments.
FAQs
1. Why do passive check results not update service status?
This typically occurs due to timestamp mismatches, malformed submissions, or disabled freshness checking. Confirm submission format and freshness configuration.
2. How can I confirm NRDP submissions are being received?
Check the NRDP transfer.log and the web server logs for POST entries. Also, monitor `/usr/local/nagios/var/spool/checkresults` for processing activity.
3. Can I disable freshness checking entirely?
Yes, but it's not recommended. Without freshness, Nagios cannot detect stale states for services depending on passive updates.
4. What causes flapping with passive checks?
Inconsistent check timing, overlapping state changes, or erratic passive submissions can trigger flapping. Normalize input intervals and suppress duplicates.
5. Are there modern alternatives to passive check submission?
Yes. Message queue integrations, agent-based monitoring (e.g., NRPE with polling), or Prometheus exporters with Nagios bridges can replace traditional passive flows.