Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this article, you learn how to monitor GPU metrics collected by the NVIDIA Data Center GPU Manager (DCGM) exporter in Azure Kubernetes Service (AKS) using Azure Managed Prometheus and Azure Managed Grafana.
Prerequisites
- An AKS cluster with NVIDIA GPU-enabled node pool(s) and ensure that the GPUs are schedulable.
- A sample GPU workload deployed to your node pool.
- Azure Managed Prometheus and Grafana enabled on your AKS cluster.
- An Azure Container Registry (ACR) integrated with your AKS cluster.
- Helm version 3 or above installed on your cluster.
Install NVIDIA DCGM Exporter
NVIDIA DCGM Exporter collects and exports GPU metrics. It runs as a pod on your AKS cluster and gathers metrics such as utilization, memory usage, temperature, and power consumption. For more information, see the NVIDIA DCGM Exporter documentation.
Important
Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.
For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.
Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.
Update default configurations of the NVIDIA DCGM Exporter
Clone the NVIDIA/dcgm-exporter GitHub repository.
git clone https://github.com/NVIDIA/dcgm-exporter.git
Navigate to the new
dcgm-exporter
directory.cd dcgm-exporter
Open the
service-monitor.yaml
and update theapiVersion
key toazmonitoring.coreos.com/v1
. This change allows the NVIDIA DCGM exporter to surface metrics in Azure Managed Prometheus.apiVersion: azmonitoring.coreos.com/v1 ... ...
Navigate to the
deployment
directory and open thevalues.yaml
file. Update the following fields in this YAML manifest:... ... serviceMonitor: apiVersion: "azmonitoring.coreos.com/v1" ... ... nodeSelector: accelerator: "nvidia" tolerations: - key: "sku" operator: "Equal" value: "gpu" effect: "NoSchedule" ... ...
Push the NVIDIA DCGM exporter Helm chart to your Azure Container Registry
Navigate to the
deployment
folder of the cloned repository, and then package the Helm chart using thehelm package
command.helm package .
Authenticate Helm with your ACR using the
helm registry login
command. Replace<acr_url>
,<user_name>
, and<password>
with your ACR details. For more detailed instructions, see Authenticate Helm with Azure Container Registry.helm registry login <acr_url> --username <user_name> --password <password>
Push the Helm chart to your ACR using the
helm push
command. Replace<dcgm_exporter_version>
with the version noted in the output ofhelm package
command and<acr_url>
with your ACR URL.helm push dcgm-exporter-<dcgm_exporter_version>.tgz oci://<acr_url>/helm
Install the Helm chart on your AKS cluster using the
helm install
command, in the same namespace as your GPU-enabled node pool. Replace<acr_url>
with your ACR URL.helm install dcgm-nvidia oci://<acr_url>/helm/dcgm-exporter -n <gpu_namespace>
Check the installation on your AKS cluster using the
helm list
command.helm list -n <gpu_namespace>
Verify the NVIDIA DCGM Exporter is running on your GPU node pool using the
kubectl get pods
andkubectl get ds
commands.kubectl get pods -n <gpu_namespace> kubectl get ds -n <gpu_namespace>
Export GPU Prometheus metrics and configure the NVIDIA Grafana dashboard
Once NVIDIA DCGM Exporter is successfully deployed to your GPU node pool, you need to export the default enabled GPU metrics to Azure Managed Prometheus by deploying a Kubernetes PodMonitor
resource.
Create a file named
pod-monitor.yaml
and add the following configuration to it:apiVersion: azmonitoring.coreos.com/v1 kind: PodMonitor metadata: name: nvidia-dcgm-exporter labels: app.kubernetes.io/name: nvidia-dcgm-exporter spec: selector: matchLabels: app.kubernetes.io/name: nvidia-dcgm-exporter podMetricsEndpoints: - port: metrics interval: 30s podTargetLabels:
Apply this PodMonitor configuration to your AKS cluster using the
kubectl apply
command in thekube-system
namespace.kubectl apply -f pod-monitor.yaml -n kube-system
Verify the PodMonitor was successfully created using the
kubectl get podmonitor
command.kubectl get podmonitor -n kube-system
In the Azure portal, navigate to the Managed Prometheus > Prometheus explorer section of your Azure Monitor workspace. Select the Grid tab and search for an example DCGM GPU metric in the PromQL box. For example
DCGM_FI_DEV_SM_CLOCK
:Import the dcgm-exporter-dashboard.json into your Managed Grafana instance using the steps in Create a dashboard in Azure Managed Grafana. After importing the JSON, the dashboard displaying GPU metrics should be visible in your Grafana instance.
Next steps
- Deploy a multi-instance GPU (MIG) workload on AKS.
- Explore the AI toolchain operator add-on (preview) for AI inferencing and fine-tuning.
- Learn more about Ray clusters on AKS.
Azure Kubernetes Service