Troubleshooting Persistent Node NotReady Conditions in Kubernetes

Details: Category: DevOps Tools; By Mindful Chase; 14.Aug; Hits: 140

Kubernetes has become the backbone of modern cloud-native infrastructure, but even seasoned DevOps engineers can encounter elusive, high-impact issues. One such problem is the 'Node NotReady' condition persisting in production clusters. While this status simply means the kubelet on a node has failed to report healthy status, the root causes can range from network partitions to disk pressure, kubelet crashes, or underlying VM failures. In large-scale environments, a single Node NotReady can trigger workload rescheduling, cascading latency, and even partial outages if pod disruption budgets are exceeded. Troubleshooting this problem effectively requires deep insight into Kubernetes internals, infrastructure dependencies, and proactive cluster health monitoring.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding Node NotReady in Kubernetes

Background and Root Causes

Nodes in Kubernetes regularly send heartbeats to the control plane via the kubelet. If the API server does not receive updates within a set interval, the node status changes to NotReady. This is often caused by:

Network failures between node and API server.
Disk pressure or filesystem corruption preventing kubelet operations.
Resource exhaustion (CPU, memory) starving kubelet and critical daemons.
Kubelet misconfiguration or crash loops.
Cloud provider VM or hypervisor failures.

Architectural Implications

In clusters hosting critical workloads, a NotReady node can cause pods to be evicted and rescheduled, consuming additional resources and potentially breaching latency SLAs. In stateful applications, forced pod eviction may result in data re-replication, impacting performance cluster-wide. If multiple nodes become NotReady simultaneously (e.g., due to an AZ-wide outage), recovery can be slow and resource-intensive.

Diagnostics

Immediate Status Checks

Begin with a cluster-wide node status check:

kubectl get nodes -o wide

To inspect the specific node's condition and events:

kubectl describe node NODE_NAME

Investigating Kubelet Health

Access the node directly and check kubelet logs:

journalctl -u kubelet --since "1h"

Look for signs of:

Container runtime errors.
Filesystem I/O errors.
Authentication or TLS certificate issues with API server.

Network Diagnostics

From the node, test API server reachability:

curl -k https://API_SERVER:6443/healthz --cert /var/lib/kubelet/pki/kubelet-client-current.pem --key /var/lib/kubelet/pki/kubelet-client-current.pem

Common Pitfalls in Fix Attempts

Rebooting the node without root cause analysis — may hide deeper issues.
Manually removing the node without draining pods — risks data loss in stateful sets.
Blaming kubelet without checking underlying infrastructure failures.

Step-by-Step Fixes

1. Resolve Disk Pressure

Check disk usage and clear unused images/containers:

df -h
docker system prune -af

2. Restart and Validate Kubelet

systemctl restart kubelet
systemctl status kubelet

3. Restore Network Connectivity

For CNI-related failures, restart the CNI pods on the node:

kubectl delete pod -n kube-system -l k8s-app=calico-node --force --grace-period=0

4. Drain and Rejoin Node

kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data
kubectl delete node NODE_NAME
# Re-register after fixing issues

Best Practices for Prevention

Enable node problem detector to surface kernel/disk/network issues early.
Set up proactive alerts on NodeNotReady events and kubelet health.
Implement autoscaling groups or node pools with health checks for automatic replacement.
Regularly rotate TLS certificates and verify kubelet authentication.
Simulate node failures in staging to test workload resilience.

Conclusion

Persistent Node NotReady conditions in Kubernetes can degrade reliability and user experience if left unaddressed. By combining low-level node diagnostics, infrastructure checks, and preventive automation, DevOps teams can quickly remediate failures and build clusters that recover gracefully from disruptions.

FAQs

1. Can a NodeNotReady state be caused by API server overload?

Yes. If the API server is under heavy load or experiencing network latency, heartbeats from healthy nodes may be delayed, leading to false NotReady statuses.

2. Should I always remove NotReady nodes?

No. Investigate and fix the issue first. Removing without diagnosis can cause unnecessary pod rescheduling and data replication overhead.

3. How long does Kubernetes wait before marking a node NotReady?

By default, the kube-controller-manager marks nodes NotReady after approximately 40 seconds without a successful status update from the kubelet.

4. Can cordoning a node help in troubleshooting?

Yes. Cordoning prevents new pods from being scheduled on the node while you investigate the root cause.

5. Does managed Kubernetes (EKS, GKE, AKS) handle NodeNotReady automatically?

Many managed services will attempt to replace unhealthy nodes automatically, but diagnosis is still critical to prevent repeated failures.

Contact Us