Node NotReady — Runbook
Audience: Operations / SRE / Platform Team
Copre: condizioni nodo, salute kubelet, resource pressure (CPU/Mem/Disk), storage/IO, runtime CRI-O, MachineConfigPool, query Prometheus forensiche.
1. Setup
Section titled “1. Setup”NODE=<node-name># es: NODE=ocp4-worker-02. Node status & conditions
Section titled “2. Node status & conditions”oc get node $NODE -o wideoc describe node $NODE | sed -n '/Conditions:/,/Addresses:/p'oc describe node $NODE | sed -n '/Events:/,$p'Cosa cercare: Ready=False, KubeletNotReady, MemoryPressure=True, DiskPressure=True, PIDPressure=True, NetworkUnavailable=True.
3. Kubelet health (API proxy check)
Section titled “3. Kubelet health (API proxy check)”oc get --raw /api/v1/nodes/$NODE/proxy/healthz ; echooc get --raw /api/v1/nodes/$NODE/proxy/stats/summary | head| Risultato | Interpretazione |
|---|---|
| Timeout | kubelet down / nodo frozen |
| 200 OK | kubelet running |
| Stats error | problema runtime/storage |
4. Debug shell sul nodo
Section titled “4. Debug shell sul nodo”oc debug node/$NODE -- chroot /host bash4.1 Reboot check
Section titled “4.1 Reboot check”uptimewho -bjournalctl -b -1 -n 200 --no-pager4.2 Memory check
Section titled “4.2 Memory check”free -mvmstat 1 5journalctl -k | egrep -i "oom|out of memory"4.3 Disk & inode check
Section titled “4.3 Disk & inode check”df -hTdf -hiPath critici: /, /var/lib/containers, /var/lib/kubelet. Se >85% → investigare subito.
4.4 Storage / IO errors
Section titled “4.4 Storage / IO errors”journalctl -k | egrep -i "I/O error|blocked for more than|hung|reset|xfs|ext4"4.5 Kubelet logs
Section titled “4.5 Kubelet logs”journalctl -u kubelet -n 200 --no-pagerCercare: PLEG is not healthy, Container runtime is down, DeadlineExceeded, errori dell’eviction manager.
4.6 CRI-O logs
Section titled “4.6 CRI-O logs”journalctl -u crio -n 200 --no-pagerCercare: storage timeouts, overlay errors, rpc timeout.
4.7 Runtime check
Section titled “4.7 Runtime check”crictl ps -a5. Cluster resource pressure
Section titled “5. Cluster resource pressure”oc adm top nodesoc adm top pods -A --sort-by=memory | head -20oc adm top pods -A --sort-by=cpu | head -206. MachineConfigPool
Section titled “6. MachineConfigPool”oc get mcpoc describe mcp workerSe Updating=True o Degraded=True → il nodo potrebbe rebootare per un update MCO.
7. Query Prometheus (forensics)
Section titled “7. Query Prometheus (forensics)”# Node Ready statekube_node_status_condition{condition="Ready"}
# Nodes NotReadykube_node_status_condition{condition="Ready", status="false"}
# Memory usage %100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# OOM eventsincrease(node_vmstat_oom_kill[1h])
# Root filesystem usage %100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# Container runtime disk usage %100 - (node_filesystem_avail_bytes{mountpoint="/var/lib/containers"} / node_filesystem_size_bytes{mountpoint="/var/lib/containers"}) * 100
# Disk IO saturationrate(node_disk_io_time_seconds_total[5m])
# Kubelet restart detectionchanges(process_start_time_seconds{job="kubelet"}[1h])8. Root cause comuni
Section titled “8. Root cause comuni”| Sintomo | Root cause |
|---|---|
| healthz timeout | kubelet crash / nodo frozen |
| OOM events | esaurimento memoria |
| DiskPressure | disco pieno |
| IO wait alto | latenza storage backend |
| kubelet restarts | crash del runtime |
| MCP Updating | reboot triggerato da MCO |
9. Escalation
Section titled “9. Escalation”- Infrastructure team → IO errors, storage latency, reboot VM
- Application team → memory leak, CPU alta
- Platform team → MCO degraded, kubelet crash loop
10. Azioni immediate (se workload impattato)
Section titled “10. Azioni immediate (se workload impattato)”oc adm cordon $NODEoc adm drain $NODE --ignore-daemonsets --delete-emptydir-data --force --grace-period=60 --timeout=10mPoi investigare offline.
11. Chiusura incidente
Section titled “11. Chiusura incidente”- catturare i grafici Prometheus
- allegare i log kubelet
- allegare output
df - documentare la timeline (prima occorrenza NotReady)