Understanding Node NotReady in Kubernetes
Background and Root Causes
Nodes in Kubernetes regularly send heartbeats to the control plane via the kubelet. If the API server does not receive updates within a set interval, the node status changes to NotReady. This is often caused by:
- Network failures between node and API server.
- Disk pressure or filesystem corruption preventing kubelet operations.
- Resource exhaustion (CPU, memory) starving kubelet and critical daemons.
- Kubelet misconfiguration or crash loops.
- Cloud provider VM or hypervisor failures.
Architectural Implications
In clusters hosting critical workloads, a NotReady node can cause pods to be evicted and rescheduled, consuming additional resources and potentially breaching latency SLAs. In stateful applications, forced pod eviction may result in data re-replication, impacting performance cluster-wide. If multiple nodes become NotReady simultaneously (e.g., due to an AZ-wide outage), recovery can be slow and resource-intensive.
Diagnostics
Immediate Status Checks
Begin with a cluster-wide node status check:
kubectl get nodes -o wide
To inspect the specific node's condition and events:
kubectl describe node NODE_NAME
Investigating Kubelet Health
Access the node directly and check kubelet logs:
journalctl -u kubelet --since "1h"
Look for signs of:
- Container runtime errors.
- Filesystem I/O errors.
- Authentication or TLS certificate issues with API server.
Network Diagnostics
From the node, test API server reachability:
curl -k https://API_SERVER:6443/healthz --cert /var/lib/kubelet/pki/kubelet-client-current.pem --key /var/lib/kubelet/pki/kubelet-client-current.pem
Common Pitfalls in Fix Attempts
- Rebooting the node without root cause analysis — may hide deeper issues.
- Manually removing the node without draining pods — risks data loss in stateful sets.
- Blaming kubelet without checking underlying infrastructure failures.
Step-by-Step Fixes
1. Resolve Disk Pressure
Check disk usage and clear unused images/containers:
df -h docker system prune -af
2. Restart and Validate Kubelet
systemctl restart kubelet systemctl status kubelet
3. Restore Network Connectivity
For CNI-related failures, restart the CNI pods on the node:
kubectl delete pod -n kube-system -l k8s-app=calico-node --force --grace-period=0
4. Drain and Rejoin Node
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data kubectl delete node NODE_NAME # Re-register after fixing issues
Best Practices for Prevention
- Enable node problem detector to surface kernel/disk/network issues early.
- Set up proactive alerts on NodeNotReady events and kubelet health.
- Implement autoscaling groups or node pools with health checks for automatic replacement.
- Regularly rotate TLS certificates and verify kubelet authentication.
- Simulate node failures in staging to test workload resilience.
Conclusion
Persistent Node NotReady conditions in Kubernetes can degrade reliability and user experience if left unaddressed. By combining low-level node diagnostics, infrastructure checks, and preventive automation, DevOps teams can quickly remediate failures and build clusters that recover gracefully from disruptions.
FAQs
1. Can a NodeNotReady state be caused by API server overload?
Yes. If the API server is under heavy load or experiencing network latency, heartbeats from healthy nodes may be delayed, leading to false NotReady statuses.
2. Should I always remove NotReady nodes?
No. Investigate and fix the issue first. Removing without diagnosis can cause unnecessary pod rescheduling and data replication overhead.
3. How long does Kubernetes wait before marking a node NotReady?
By default, the kube-controller-manager marks nodes NotReady after approximately 40 seconds without a successful status update from the kubelet.
4. Can cordoning a node help in troubleshooting?
Yes. Cordoning prevents new pods from being scheduled on the node while you investigate the root cause.
5. Does managed Kubernetes (EKS, GKE, AKS) handle NodeNotReady automatically?
Many managed services will attempt to replace unhealthy nodes automatically, but diagnosis is still critical to prevent repeated failures.