Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Node Problem Detector (NPD) is an open source Kubernetes component that detects node-related problems and reports on them. It runs as a systemd serviced on each node in the cluster and collects various metrics and system information, such as CPU usage, disk usage, and network connectivity. When it detects a problem, it generates events and/or node conditions. Azure Kubernetes Service (AKS) uses NPD to monitor and manage nodes in a Kubernetes cluster running on the Azure cloud platform. The AKS Linux extension enables NPD by default.
Note
Upgrades to NPD are independent of the node image and Kubernetes version upgrade processes. If a node pool is unhealthy (that is, in a failed state), new NPD versions aren't installed.
Node conditions
Node conditions indicate a permanent problem that makes the node unavailable. AKS uses the following node conditions from NPD to expose permanent problems on the node. NPD also emits corresponding Kubernetes events.
Problem Daemon type | NodeCondition | Reason |
---|---|---|
CustomPluginMonitor | FilesystemCorruptionProblem | FilesystemCorruptionDetected |
CustomPluginMonitor | KubeletProblem | KubeletIsDown |
CustomPluginMonitor | ContainerRuntimeProblem | ContainerRuntimeIsDown |
CustomPluginMonitor | VMEventScheduled | VMEventScheduled |
CustomPluginMonitor | FrequentUnregisterNetDevice | UnregisterNetDevice |
CustomPluginMonitor | FrequentKubeletRestart | FrequentKubeletRestart |
CustomPluginMonitor | FrequentContainerdRestart | FrequentContainerdRestart |
CustomPluginMonitor | FrequentDockerRestart | FrequentDockerRestart |
SystemLogMonitor | KernelDeadlock | DockerHung |
SystemLogMonitor | ReadonlyFilesystem | FilesystemIsReadOnly |
Events
NPD emits events with relevant information to help you diagnose underlying issues.
Problem Daemon type | Reason | Frequency | Description | Action |
---|---|---|---|---|
CustomPluginMonitor | EgressBlocked | 30 min | This event checks for connectivity to external endpoints | Check if a firewall or NSG blocking the connectivity to the endpoint getting flagged |
CustomPluginMonitor | FilesystemCorruptionDetected | 5min | This checks for filesystem corruption surfaced by docker | |
CustomPluginMonitor | KubeletIsDown | 30s | This checks if kubelet service is running and healthy | |
CustomPluginMonitor | ContainerRuntimeIsDown | 30s | This event checks if the container runtime eg: containerd is running and healthy | |
CustomPluginMonitor | FreezeScheduled | 1min | This event checks if a Freeze Event is scheduled on the node. Check https://aka.ms/aks/scheduledevents for more information | |
CustomPluginMonitor | RebootScheduled | 1min | This event checks if a Reboot Event is scheduled on the node Check https://aka.ms/aks/scheduledevents for more information | |
CustomPluginMonitor | RedeployScheduled | 1min | This event checks if a Redeploy Event is scheduled on the node. Check https://aka.ms/aks/scheduledevents for more information | |
CustomPluginMonitor | TerminateScheduled | 1min | This event checks if a Terminate Event is scheduled on the node. Check https://aka.ms/aks/scheduledevents for more information | |
CustomPluginMonitor | PreemptScheduled | 2s | This event checks if a Preempt Event is scheduled on the node. Check https://aka.ms/aks/scheduledevents for more information | |
CustomPluginMonitor | DNSProblem | |||
SystemLogMonitor | OOMKilling | |||
SystemLogMonitor | TaskHung | |||
SystemLogMonitor | UnregisterNetDevice | |||
SystemLogMonitor | KernelOops | |||
SystemLogMonitor | DockerSocketCannotConnect | |||
SystemLogMonitor | KubeletRPCDeadlineExceeded | |||
SystemLogMonitor | KubeletRPCNoSuchContainer | |||
SystemLogMonitor | CNICannotStatFS | |||
SystemLogMonitor | PLEGUnhealthy | |||
SystemLogMonitor | KubeletStart | |||
SystemLogMonitor | DockerStart | |||
SystemLogMonitor | ContainerdStart |
In certain instances, AKS automatically cordons and drains the node to minimize disruption to workloads. For more information about the events and actions, see Node autodrain.
EgressBlocked
The list of endpoints checked by the EgressBlocked are listed below
Note
The actual endpoints will depend on the type of the cluster and the location where it's hosted (Public cloud vs Airgapped clouds). Review the documentation for outbound access here. The documentation is for public clouds
Type | Example | Note |
---|---|---|
MCR | https://mcr.microsoft.com | |
Microsoft Entra ID | https://login.microsoftonline.com" | |
Resource Manager | https://management.azure.com | |
Packages | https://packages.microsoft.com | |
Kube Binary | https://acs-mirror.azureedge.net/acs-mirror/healthz,https://packages.aks.azure.com/acs-mirror/healthz |
Check the node conditions and events
Check the node conditions and events using the
kubectl describe node
command.kubectl describe node my-aks-node
Your output should look similar to the following example condensed output:
... ... Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- VMEventScheduled False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoVMEventScheduled VM has no scheduled event FrequentContainerdRestart False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentContainerdRestart containerd is functioning properly FrequentDockerRestart False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentDockerRestart docker is functioning properly FilesystemCorruptionProblem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 FilesystemIsOK Filesystem is healthy FrequentUnregisterNetDevice False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentUnregisterNetDevice node is functioning properly ContainerRuntimeProblem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:40 +0000 ContainerRuntimeIsUp container runtime service is up KernelDeadlock False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 KernelHasNoDeadlock kernel has no deadlock FrequentKubeletRestart False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentKubeletRestart kubelet is functioning properly KubeletProblem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 KubeletIsUp kubelet service is up ReadonlyFilesystem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 FilesystemIsNotReadOnly Filesystem is not read-only NetworkUnavailable False Thu, 01 Jun 2023 03:58:39 +0000 Thu, 01 Jun 2023 03:58:39 +0000 RouteCreated RouteController created a route MemoryPressure True Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 19:16:50 +0000 KubeletHasInsufficientMemory kubelet has insufficient memory available DiskPressure False Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 03:57:22 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 03:57:22 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 03:57:23 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled ... ... ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasSufficientMemory 94s (x176 over 15h) kubelet Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory
These events are also available in Container Insights through KubeEvents.
Metrics
NPD also exposes Prometheus metrics based on the node problems, which you can use for monitoring and alerting. These metrics are exposed on port 20257 of the Node IP and Prometheus can scrape them.
The following example YAML shows a scrape config you can use with the Azure Managed Prometheus add on as a DaemonSet:
kind: ConfigMap
apiVersion: v1
metadata:
name: ama-metrics-prometheus-config-node
namespace: kube-system
data:
prometheus-config: |-
global:
scrape_interval: 1m
scrape_configs:
- job_name: node-problem-detector
scrape_interval: 1m
scheme: http
metrics_path: /metrics
relabel_configs:
- source_labels: [__metrics_path__]
regex: (.*)
target_label: metrics_path
- source_labels: [__address__]
replacement: '$NODE_NAME'
target_label: instance
static_configs:
- targets: ['$NODE_IP:20257']
The following example shows the scraped metrics:
problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0
Next steps
For more information on NPD, see kubernetes/node-problem-detector.
Azure Kubernetes Service