Restart / CrashLoop / OOM
Restart
Section titled “Restart”Restart per pod (15m)
Section titled “Restart per pod (15m)”increase( kube_pod_container_status_restarts_total{namespace!~"kube-.*|openshift-.*"}[15m])Crash looping (>3 restart in 1m)
Section titled “Crash looping (>3 restart in 1m)”increase(kube_pod_container_status_restarts_total[1m]) > 3OOMKilled
Section titled “OOMKilled”Stato attuale
Section titled “Stato attuale”kube_pod_container_status_last_terminated_reason{ namespace!~"kube-.*|openshift-.*", reason="OOMKilled"}Storico (24h)
Section titled “Storico (24h)”increase( kube_pod_container_status_last_terminated_reason{ namespace!~"kube-.*|openshift-.*", reason="OOMKilled" }[24h])OOM killer con correlazione al container specifico (alert-ready)
Section titled “OOM killer con correlazione al container specifico (alert-ready)”(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1)and ignoring (reason)min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1OOM a livello nodo (kernel)
Section titled “OOM a livello nodo (kernel)”increase(node_vmstat_oom_kill[1h])Stato generale pod
Section titled “Stato generale pod”Pod non Ready
Section titled “Pod non Ready”kube_pod_status_ready{ namespace!~"kube-.*|openshift-.*", condition="false"}Pod non sani (Pending/Unknown/Failed)
Section titled “Pod non sani (Pending/Unknown/Failed)”sum by (namespace, pod) ( kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0Numero pod per namespace
Section titled “Numero pod per namespace”count by (namespace) ( kube_pod_info{namespace!~"kube-.*|openshift-.*"})Pod out of capacity sul nodo (>90% pod allocabili)
Section titled “Pod out of capacity sul nodo (>90% pod allocabili)”sum by (node) ( (kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""}))/sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90