Troubleshooting Nagios Services Stuck in PENDING State

Details: Category: DevOps Tools; By Mindful Chase; 05.Aug; Hits: 106

Nagios has long been a cornerstone of infrastructure monitoring, especially in traditional DevOps environments. Despite its reliability and extensibility, Nagios users frequently encounter a critical and often misunderstood issue: "Nagios services stuck in PENDING state". This condition results in services never transitioning to OK, WARNING, or CRITICAL, effectively making monitoring blind. When left unresolved, this can lead to missed alerts, compliance risks, and undetected outages—making it a high-priority concern for SREs, DevOps engineers, and enterprise infrastructure teams.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding the Issue

What Does PENDING State Mean?

In Nagios, the PENDING state indicates that a host or service check has not yet been executed. While this is expected during initial startup or configuration reloads, prolonged PENDING status suggests deeper scheduling or configuration issues.

Impact on Monitoring Operations

Services stuck in PENDING don't trigger notifications, logs, or escalations. In large-scale deployments, this could silently suppress entire layers of monitoring visibility, delaying incident response.

Root Causes

Improper check_interval or retry_interval: Extremely high intervals or missing values can delay or prevent execution.
Disabled active checks: Misconfiguration in `check_command` or disabling active checks globally or per service.
Scheduler overload: Large environments with too many checks and insufficient Nagios daemon resources.
Corrupted retention files: Retention.dat or status.dat inconsistencies after abrupt service restarts.
Missing or misconfigured timeperiods: Services scheduled outside valid time windows won’t run.

Diagnostic Workflow

1. Check Service Definition

Ensure each service has a valid `check_command` and intervals set:

define service {
  host_name            myserver
  service_description  CPU Load
  check_command        check_load
  check_interval       5
  retry_interval       1
  active_checks_enabled 1
}

2. Review Nagios Scheduler Load

Use Nagios web UI or CLI to inspect the number of checks vs capacity:

ps -ef | grep nagios
top
iostat -xz 1

If CPU or IO is saturated, consider increasing `max_concurrent_checks` or distributing load.

3. Inspect nagios.log

Key entries like "Warning: The check of service ... was not executed" may indicate config issues:

/usr/local/nagios/var/nagios.log

4. Validate Time Periods

Ensure defined time periods are not excluding checks unintentionally:

define timeperiod {
  timeperiod_name  24x7
  alias            24 Hours A Day, 7 Days A Week
  sunday           00:00-24:00
  monday           00:00-24:00
  ...
}

5. Restart Nagios Cleanly

Stop Nagios and clear cache files:

service nagios stop
rm -f /usr/local/nagios/var/status.dat
rm -f /usr/local/nagios/var/retention.dat
service nagios start

Monitor whether services begin transitioning from PENDING.

Best Practices and Long-Term Fixes

Use proper intervals: Avoid extreme check/retry intervals. Stick to 5-10 min checks and 1-2 min retries for most services.
Scale out with distributed monitoring: Use Nagios Remote Plugin Executor (NRPE) or mod_gearman to offload checks.
Validate all objects post-edit: Always run `nagios -v /etc/nagios/nagios.cfg` after config changes.
Monitor Nagios itself: Set alerts for scheduler latency, check queue depth, and daemon uptime.
Version upgrades: Ensure you are not affected by known bugs in older versions related to the event scheduler.

Conclusion

When services remain in a PENDING state in Nagios, it's often symptomatic of deeper systemic issues—be it resource constraints, misconfiguration, or scheduling conflicts. Through structured diagnosis—starting from service definitions and check intervals to scheduler load and time periods—engineers can restore full monitoring functionality. In mission-critical environments, proactively applying these best practices ensures that Nagios remains a reliable part of your DevOps toolchain.

FAQs

1. Is it normal for services to be in PENDING after a Nagios restart?

Yes, briefly. However, they should transition to OK/WARNING/CRITICAL within the first check interval (typically a few minutes).

2. Can I force Nagios to immediately check all PENDING services?

Yes, use the "Schedule a forced check" option in the web UI or use the `submit_check_result` external command via CLI.

3. Do passive checks affect PENDING status?

Yes. If a service relies only on passive checks and none have been submitted, it will remain in PENDING until one arrives.

4. What logs help the most when troubleshooting PENDING states?

The main Nagios log (`nagios.log`) and configuration validation output are the most helpful for identifying root causes.

5. Is there a risk in deleting status.dat or retention.dat?

Only if Nagios is running. Stop the service before deleting these files. They will be regenerated cleanly on restart.

Contact Us