Configure GPU monitoring with Container insights

Starting with agent version ciprod03022019, Container insights integrated agent now supports monitoring GPU (graphical processing units) usage on GPU-aware Kubernetes cluster nodes, and monitor pods/containers requesting and using GPU resources.

Note

As per the Kubernetes upstream announcement, Kubernetes is deprecating GPU metrics that are being reported by the kubelet, for Kubernetes ver. 1.20+. This means Container Insights will no longer be able to collect the following metrics out of the box:

  • containerGpuDutyCycle
  • containerGpumemoryTotalBytes
  • containerGpumemoryUsedBytes

To continue collecting GPU metrics through Container Insights, please migrate by December 31, 2022 to your GPU vendor specific metrics exporter and configure Prometheus scraping to scrape metrics from the deployed vendor specific exporter.

Supported GPU vendors

Container insights supports monitoring GPU clusters from following GPU vendors:

Container insights automatically starts monitoring GPU usage on nodes, and GPU requesting pods and workloads by collecting the following metrics at 60sec intervals and storing them in the InsightMetrics table.

Note

After provisioning cluster with GPU nodes, ensure that GPU driver is installed as required by AKS to run GPU workloads. Container insights collect GPU metrics through GPU driver pods running in the node.

Metric name Metric dimension (tags) Description
containerGpuDutyCycle* container.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100.
containerGpuLimits container.azm.ms/clusterId, container.azm.ms/clusterName, containerName Each container can specify limits as one or more GPUs. It is not possible to request or limit a fraction of a GPU.
containerGpuRequests container.azm.ms/clusterId, container.azm.ms/clusterName, containerName Each container can request one or more GPUs. It is not possible to request or limit a fraction of a GPU.
containerGpumemoryTotalBytes* container.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor Amount of GPU Memory in bytes available to use for a specific container.
containerGpumemoryUsedBytes* container.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor Amount of GPU Memory in bytes used by a specific container.
nodeGpuAllocatable container.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor Number of GPUs in a node that can be used by Kubernetes.
nodeGpuCapacity container.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor Total Number of GPUs in a node.

* Based on Kubernetes upstream changes, these metrics are no longer collected out of the box, as a temporary hotfix, for AKS, upgrade your GPU Node pool to the latest version or *-2022.06.08 or higher. For Arc enabled Kubernetes, enable feature gate DisableAcceleratorUsageMetrics=false in Kubelet configuration of the node and restart the Kubelet. Once the upstream changes reach GA, this fix will not longer work, make plans to migrate to using your GPU vendor specific metrics exporter by December 31, 2022.

GPU performance charts

Container insights includes pre-configured charts for the metrics listed earlier in the table as a GPU workbook for every cluster. See Workbooks in Container insights for a description of the workbooks available for Container insights.

Next steps