Monitor GPU metrics from NVIDIA DCGM exporter with Azure Managed Prometheus and Managed Grafana on Azure Kubernetes Service (AKS)

2025-03-28

In this article, you learn how to monitor GPU metrics collected by the NVIDIA Data Center GPU Manager (DCGM) exporter in Azure Kubernetes Service (AKS) using Azure Managed Prometheus and Azure Managed Grafana.

Prerequisites

An AKS cluster with NVIDIA GPU-enabled node pool(s) and ensure that the GPUs are schedulable.
A sample GPU workload deployed to your node pool.
Azure Managed Prometheus and Grafana enabled on your AKS cluster.
An Azure Container Registry (ACR) integrated with your AKS cluster.
Helm version 3 or above installed on your cluster.

Install NVIDIA DCGM Exporter

NVIDIA DCGM Exporter collects and exports GPU metrics. It runs as a pod on your AKS cluster and gathers metrics such as utilization, memory usage, temperature, and power consumption. For more information, see the NVIDIA DCGM Exporter documentation.

Important

Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.

For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.

Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.

Update default configurations of the NVIDIA DCGM Exporter

Clone the NVIDIA/dcgm-exporter GitHub repository.

git clone https://github.com/NVIDIA/dcgm-exporter.git

Navigate to the new dcgm-exporter directory.
```
cd dcgm-exporter
```
Open the service-monitor.yaml and update the apiVersion key to azmonitoring.coreos.com/v1. This change allows the NVIDIA DCGM exporter to surface metrics in Azure Managed Prometheus.
```
apiVersion: azmonitoring.coreos.com/v1
...
...
```

Navigate to the deployment directory and open the values.yaml file. Update the following fields in this YAML manifest:

...
...
serviceMonitor:
    apiVersion: "azmonitoring.coreos.com/v1"
...
...
nodeSelector:
    accelerator: "nvidia"

tolerations:
- key: "sku"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
...
...

Push the NVIDIA DCGM exporter Helm chart to your Azure Container Registry

Navigate to the deployment folder of the cloned repository, and then package the Helm chart using the helm package command.
```
helm package .
```
Authenticate Helm with your ACR using the helm registry login command. Replace <acr_url>, <user_name>, and <password> with your ACR details. For more detailed instructions, see Authenticate Helm with Azure Container Registry.
```
helm registry login <acr_url> --username <user_name> --password <password>
```
Push the Helm chart to your ACR using the helm push command. Replace <dcgm_exporter_version> with the version noted in the output of helm package command and <acr_url> with your ACR URL.
```
helm push dcgm-exporter-<dcgm_exporter_version>.tgz oci://<acr_url>/helm
```
Install the Helm chart on your AKS cluster using the helm install command, in the same namespace as your GPU-enabled node pool. Replace <acr_url> with your ACR URL.
```
helm install dcgm-nvidia oci://<acr_url>/helm/dcgm-exporter -n <gpu_namespace>
```
Check the installation on your AKS cluster using the helm list command.
```
helm list -n <gpu_namespace>
```
Verify the NVIDIA DCGM Exporter is running on your GPU node pool using the kubectl get pods and kubectl get ds commands.
```
kubectl get pods -n <gpu_namespace>
kubectl get ds -n <gpu_namespace>
```

Export GPU Prometheus metrics and configure the NVIDIA Grafana dashboard

Once NVIDIA DCGM Exporter is successfully deployed to your GPU node pool, you need to export the default enabled GPU metrics to Azure Managed Prometheus by deploying a Kubernetes PodMonitor resource.

Create a file named pod-monitor.yaml and add the following configuration to it:

apiVersion: azmonitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: nvidia-dcgm-exporter
  labels:
    app.kubernetes.io/name: nvidia-dcgm-exporter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-dcgm-exporter
  podMetricsEndpoints:
  - port: metrics
    interval: 30s
  podTargetLabels:

Apply this PodMonitor configuration to your AKS cluster using the kubectl apply command in the kube-system namespace.
```
kubectl apply -f pod-monitor.yaml -n kube-system
```
Verify the PodMonitor was successfully created using the kubectl get podmonitor command.
```
kubectl get podmonitor -n kube-system
```
In the Azure portal, navigate to the Managed Prometheus > Prometheus explorer section of your Azure Monitor workspace. Select the Grid tab and search for an example DCGM GPU metric in the PromQL box. For example DCGM_FI_DEV_SM_CLOCK:
Import the dcgm-exporter-dashboard.json into your Managed Grafana instance using the steps in Create a dashboard in Azure Managed Grafana. After importing the JSON, the dashboard displaying GPU metrics should be visible in your Grafana instance.

Next steps

Deploy a multi-instance GPU (MIG) workload on AKS.
Explore the AI toolchain operator add-on (preview) for AI inferencing and fine-tuning.
Learn more about Ray clusters on AKS.