Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
GPU based workloads such as AI inference services can be memory-intensive and difficult to optimize and debug without deep visibility into what the GPU is actually doing. You might see out-of-memory (OOM) errors, unexpected latency spikes, or rising GPU memory pressure, but traditional Kubernetes metrics don't tell you where in the code the memory is being allocated. Profiling helps you understand the exact functions responsible for GPU memory usage.
Important
AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:
This article walks you through how to use GPU observability on AKS:
- Deploy real-time GPU observability agent—use eBPF-based instrumentation to trace and profile GPU memory allocations.
- Read flame graphs—learn how to interpret the profiling output to find the exact functions consuming the most GPU memory.
Important
Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.
Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.
Deploy advanced GPU observability - GPU memory profiling on AKS
Prerequisites
- An AKS cluster with at least one GPU-enabled node pool.
- Azure CLI version 2.72.0 or later installed. Run
az --versionto check. - Helm version 3.x or later installed. Run
helm versionto check. - Azure Monitor (optional, can use your own monitoring setup if preferred).
- Azure Managed Grafana (optional, for visualization).
Step 1: Install Inspektor Gadget
Inspektor Gadget is an open source eBPF-based observability framework for Kubernetes. For GPU profiling, it traces Compute Unified Device Architecture (CUDA) memory allocation calls without requiring code changes, sidecars, or pod restarts.
helm install -n gadget inspektor-gadget \
oci://mcr.microsoft.com/microsoft.inspektor-gadget/helmcharts/inspektor-gadget:0.53.0-0 \
--set gpuObservability.enabled=true \
--set azureMonitor.enabled=true
Note
This step assumes you have already enabled Azure Monitor on your AKS cluster. If you plan to use your own Prometheus setup, remove --set azureMonitor.enabled=true.
Verify that pods are running:
kubectl get pods -n gadget -l k8s-app=gadget
Step 2: Enable profile visualization with Pyroscope
Note
If you have an existing Grafana/Pyroscope stack in your cluster, you can skip this step.
Pyroscope is an open source project that lets you visualize and store performance profiles, which are needed for memory optimization and troubleshooting. Run the following command to deploy a single Pyroscope instance to your cluster:
helm install pyroscope -n gadget \
oci://ghcr.io/grafana/helm-charts/pyroscope \
--version 1.15.0 \
--set pyroscope.image.repository=grafana/pyroscope \
--set-string pyroscope.image.tag=1.15.0 \
--set pyroscope.replicaCount=1 \
--set pyroscope.structuredConfig.self_profiling.disable_push=true \
--set pyroscope.structuredConfig.storage.backend=filesystem \
--set pyroscope.service.type=LoadBalancer \
--set pyroscope.service.port=4040 \
--set-string pyroscope.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-internal"=true \
--set-string pyroscope.service.annotations."service\.beta\.kubernetes\.io/azure-pls-create"=true \
--set-string pyroscope.service.annotations."service\.beta\.kubernetes\.io/azure-pls-name"=pyroscope-pls \
--set-string pyroscope.service.annotations."service\.beta\.kubernetes\.io/azure-pls-proxy-protocol"=false \
--set-string pyroscope.service.annotations."service\.beta\.kubernetes\.io/azure-pls-visibility"='*' \
--set alloy.enabled=false \
--set minio.enabled=false
Verify that pods are running:
kubectl get pods -n gadget pyroscope-0
Note
If you would like to deploy a highly available Pyroscope setup, refer to the Pyroscope microservices documentation for configuration options.
Step 3: Connect Pyroscope to Azure Managed Grafana
Tip
You can directly view your workload profiles using kubectl port-forward -n gadget pyroscope-0 4040:4040 to connect to the Pyroscope UI.
Connecting Pyroscope to Azure Managed Grafana enables you to visualize the GPU profiles in Grafana dashboards. We need a secure way for AMG to connect to Pyroscope running as a Kubernetes pod, so we'll establish the connection using Azure Private Link. Start by setting up cluster-related environment variables:
export RESOURCE_GROUP="<your-resource-group>"
export AKS_CLUSTER="<your-aks-cluster-name>"
export LOCATION="<your-aks-cluster-location>"
export GRAFANA_NAME="<your-azure-managed-grafana-name>"
export AKS_NODE_RG=$(az aks show -g "$RESOURCE_GROUP" -n "$AKS_CLUSTER" --query 'nodeResourceGroup' -o tsv)
Tip
If you don't have an existing Azure Managed Grafana instance, run az grafana create -n "$GRAFANA_NAME" -g "$RESOURCE_GROUP" --location "$LOCATION" -o none to create one.
Create a private endpoint to connect Pyroscope to Azure Managed Grafana. Export variables for the private endpoint:
export PYROSCOPE_PLS="pyroscope-pls"
export PYROSCOPE_MPE="pyroscope-mpe"
export PYROSCOPE_PORT="4040"
Create the private link:
Note
Creating the private link can take a few minutes.
# Check amg extension version
ver=$(az extension show --name amg --query version -o tsv)
[[ "${ver%%.*}" -ge 3 ]] && MPE="managed-private-endpoint" || MPE="mpe"
# Ensure Pyroscope PLS is present
until az network private-link-service show -n "$PYROSCOPE_PLS" -g "$AKS_NODE_RG" -o none 2>/dev/null; do
sleep 10
done
# Get the PLS resource ID
PYRO_PLS_ID=$(az network private-link-service show \
-n "$PYROSCOPE_PLS" -g "$AKS_NODE_RG" --query 'id' -o tsv)
# Create the MPE in Grafana
az grafana $MPE create \
--workspace-name "$GRAFANA_NAME" \
--resource-group "$RESOURCE_GROUP" \
--name "$PYROSCOPE_MPE" \
--private-link-resource-id "$PYRO_PLS_ID" \
--location "$LOCATION" -o none
# Wait for MPE to be ready
sleep 30
# Find the pending connection created by Grafana
PYRO_CONN=$(az network private-link-service show \
-n "$PYROSCOPE_PLS" -g "$AKS_NODE_RG" \
--query "privateEndpointConnections[?privateLinkServiceConnectionState.status=='Pending' && starts_with(name, 'grafana-${GRAFANA_NAME}')].name | [0]" -o tsv)
# Approve it
az network private-link-service connection update \
--name "$PYRO_CONN" \
--service-name "$PYROSCOPE_PLS" \
--resource-group "$AKS_NODE_RG" \
--connection-status Approved -o none
# Refresh Grafana so it sees the approval
az grafana $MPE refresh \
--workspace-name "$GRAFANA_NAME" \
--resource-group "$RESOURCE_GROUP" -o none
echo "Successfully created private-link"
Create the data source in Azure Managed Grafana:
# Check amg extension version
ver=$(az extension show --name amg --query version -o tsv)
[[ "${ver%%.*}" -ge 3 ]] && MPE="managed-private-endpoint" || MPE="mpe"
# Grab the private IP
PYRO_IP=$(az grafana $MPE show \
--workspace-name "$GRAFANA_NAME" \
--resource-group "$RESOURCE_GROUP" \
--name "$PYROSCOPE_MPE" \
--query 'privateLinkServicePrivateIP' -o tsv)
# Prepare Pyroscope URL
export PYROSCOPE_URL="http://${PYRO_IP}:${PYROSCOPE_PORT}"
# Create Pyroscope data source in Grafana
az grafana data-source create -n "$GRAFANA_NAME" -g "$RESOURCE_GROUP" --definition "{
\"name\": \"local-pyroscope\",
\"uid\": \"local-pyroscope\",
\"type\": \"grafana-pyroscope-datasource\",
\"access\": \"proxy\",
\"url\": \"${PYROSCOPE_URL}\",
\"jsonData\": { \"keepCookies\": [\"pyroscope_git_session\"] }
}" -o none
echo "Successfully created local-pyroscope data-source"
Note
If you have an existing Grafana/Pyroscope stack in your cluster, you can skip this step.
Verify the data source has a valid URL:
az grafana data-source show -n $GRAFANA_NAME --data-source local-pyroscope
Step 4: Connect Grafana to Azure Monitor managed service for Prometheus
Note
These steps are based on Connect Azure Monitor managed service for Prometheus to Grafana.
Export the required variables:
export AMP_NAME="<your-amp-workspace-name>"
export RESOURCE_GROUP="<your-resource-group>"
Tip
Run az resource list --resource-type Microsoft.Monitor/accounts -g $RESOURCE_GROUP -o table to list Azure Monitor workspace information.
# Get AMW endpoint
AMP_ENDPOINT=$(az resource show --resource-type Microsoft.Monitor/accounts \
-n "$AMP_NAME" -g "$RESOURCE_GROUP" \
--query properties.metrics.prometheusQueryEndpoint -o tsv)
# Create Prometheus data source with MSI auth
az grafana data-source create -n "$GRAFANA_NAME" -g "$RESOURCE_GROUP" \
--definition "{
\"name\": \"$AMP_NAME\",
\"type\": \"prometheus\",
\"access\": \"proxy\",
\"url\": \"$AMP_ENDPOINT\",
\"jsonData\": {
\"httpMethod\": \"POST\",
\"azureCredentials\": { \"authType\": \"msi\" }
}
}"
Verify the data source:
az grafana data-source show -n $GRAFANA_NAME --data-source $AMP_NAME
Step 5: Set up dashboards in Grafana
az grafana dashboard create \
-n "$GRAFANA_NAME" \
-g "$RESOURCE_GROUP" \
--definition "$(curl -sSL https://raw.githubusercontent.com/inspektor-gadget/grafana-dashboards/refs/heads/main/dashboards/gpu-observability/AdvancedGPUObservability.json)"
Access the dashboard at:
GRAFANA_URL=$(az grafana show -n "$GRAFANA_NAME" -g "$RESOURCE_GROUP" --query properties.endpoint -o tsv)
echo "${GRAFANA_URL}/d/AdvancedGPUObservability"
For more information about reading the flame graphs shown in Grafana, see Reading flame graphs.
Clean up resources
To remove the in-cluster GPU observability stack:
helm uninstall inspektor-gadget -n gadget
helm uninstall pyroscope -n gadget
kubectl delete namespace gadget
Reading flame graphs
After profiling data is flowing into Pyroscope and Grafana, you'll see flame graphs showing which functions consume the most GPU memory. The following sections explain how to read these visualizations.
What is a flame graph?
A flame graph is a visualization of profiled call stacks. Each bar represents a function, and bars are stacked to show the call chain, who called whom. The width of each bar represents the amount of the measured resource (CPU time, GPU memory allocated, and so on) that flows through that function.
Key rule: The wider a bar, the more of the measured resource flows through that function.
Tip
Use Expand all groups in Grafana's flame graph panel to see the full call stack without collapsing. Use the Search box to find specific functions or keywords.
Read the symbols
Flame graph labels follow these conventions:
| Symbol format | Meaning |
|---|---|
Foo → bar |
class Foo: method def bar() |
Foo → __init__ |
Constructor of class Foo |
bar (alone) |
Standalone def bar() function |
<interpreter trampoline> |
CPython overhead—ignore |
<raw-address> e.g 0x7f151 |
Native C/CUDA code—no Python symbol available |
Examples:
GPUModelRunner/_allocate_kv_cache_tensors—A method on a class. Reads asclass GPUModelRunner: def _allocate_kv_cache_tensors(self).LlamaMLP/__init__—A constructor. Called when creating aLlamaMLP(...)object._compile_fx_inner—A standalone module-level function not inside any class.
Understand self vs total
This understanding is the most important concept when analyzing flame graphs.
- Total—the resource consumed by a function plus everything it calls. A function can have a large total but allocate nothing itself—it's just a call chain.
- Self—the resource consumed directly by the function, excluding its children. A high self value means this function is where the resource is actually consumed.
Example: GPUModelRunner._allocate_kv_cache_tensors has 55.1 GB self—it's the function that actually calls torch.empty() to create the KV cache tensors.
Navigation tips:
- Leaf nodes (bars with nothing above them)—their entire width is self. Start here to find allocation hotspots.
- Wide bar with 0 self—an orchestrator function that just calls others. Safe to skip when hunting for allocations.
- Wide bar with high self—your optimization target.
Find the biggest resource consumer
Use the following steps to identify hotspots:
- Look at the widest bars at the TOP of the graph—These are leaf functions where memory is actually being allocated, the wider the bar the more it consumes.
- Check self vs total—A wide bar at the bottom with
self: 0is just a call chain. Follow it upward until you find a bar with high self allocation. - Read the call stack bottom-to-top—The ordering tells you why the function was called. For example:
`<raw-address>` e.g `0x7f151` → native code entry <interpreter trampoline> → CPython dispatch <module> → script top-level EngineCoreProc.run_engine_core → vLLM engine startup EngineCore.__init__ → engine initialization EngineCore._initialize_kv_caches → KV cache setup Worker.initialize_from_config → worker setup GPUModelRunner.initialize_kv_cache → model runner GPUModelRunner._allocate_kv_cache_tensors → 💥 actual allocation
| Goal | What to look for |
|---|---|
| What allocates the most | Widest leaf bar (top of stack) |
| What's responsible for the most | Widest bar (bottom of stack) |
| Optimization targets | Bars with wide self—that's where the resource is consumed |
| Functions to ignore | Wide bars with 0 self—they just call others |