Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article provides guidance for configuring integrated monitoring for Azure CycleCloud clusters using Prometheus self-agent and Azure Managed Grafana. This solution enables real-time collection and visualization of GPU, InfiniBand, and system metrics for high-performance computing (HPC) and artificial intelligence (AI) workloads.
Overview
Starting with CycleCloud 8.8.1, monitoring your Slurm clusters is easier and more powerful than ever. By using the Prometheus self-agent, you can automate the collection of metrics from compute nodes and Slurm jobs. You get real-time insights into cluster performance and resource utilization. When you use Azure Managed Grafana, you can visualize these metrics through customizable dashboards. You can easily track system health, identify bottlenecks, and optimize workloads.
This seamless monitoring solution reduces operational overhead and enhances the reliability of your HPC environment.
The monitoring setup includes installation and configuration of:
- Prometheus Node Exporter (with InfiniBand support)
- NVIDIA DCGM exporter (for NVIDIA GPU nodes)
- SchedMD Slurm exporter (for Slurm scheduler nodes)
Prerequisites
Before you begin, ensure you have:
- An Azure subscription with permissions to create resources and register resource providers
- Azure CLI installed on your local machine (don't run deployment scripts from CycleCloud VM or Cloud Shell)
- An existing CycleCloud 8.8.1 or later environment, or Azure CycleCloud Workspace for Slurm deployment
- Permissions to create Azure Monitor Workspace and Azure Managed Grafana resources
Architecture
The monitoring solution consists of:
| Component | Description |
|---|---|
| Prometheus self-agent | Collects GPU, InfiniBand, and system metrics from each cluster node |
| Azure Monitor Workspace for Prometheus | Stores and manages the collected metrics |
| Azure Managed Grafana | Provides customizable dashboards for metric visualization |
| Managed Identity | Provides secure authentication for metric transmission |
Create the managed monitoring infrastructure
Set up Azure Monitor Workspace for Prometheus and Azure Managed Grafana to receive and visualize metrics from your CycleCloud clusters.
Step 1: Create a resource group
Create a dedicated resource group for the monitoring infrastructure:
az group create --location <location> --name <monitoring_resource_group>
Step 2: Deploy the monitoring infrastructure
Clone the monitoring repository and run the deployment script:
git clone https://github.com/Azure/cyclecloud-monitoring.git
cd cyclecloud-monitoring
./infra/deploy.sh <monitoring_resource_group>
After deployment, the script creates the following resources in your specified resource group:
- An Azure Monitor Workspace named
ccw-mon-xxx - An Azure Managed Grafana instance named
ccw-graf-xxx - Preconfigured dashboards for cluster monitoring
Step 3: Grant monitoring permissions to managed identity
To publish metrics to the Azure Monitor Workspace, the managed identity needs the Monitoring Metrics Publisher role. Assign this role to the managed identity that your cluster nodes use.
For CycleCloud Workspace for Slurm deployments
If you're using CycleCloud Workspace for Slurm (CCWS), assign the role to the ccwLockerManagedIdentity that the CCWS deployment creates:
./infra/add_publisher.sh <ccws_resource_group> ccwLockerManagedIdentity
For custom managed identity deployments
If you're using your own managed identity, run the same script with your managed identity name:
./infra/add_publisher.sh <umi_resource_group> <umi_name>
Retrieve monitoring configuration parameters
To enable monitoring on your cluster nodes, you need three configuration parameters.
Get the managed identity client ID
Retrieve the client ID of the managed identity with the Monitoring Metrics Publisher role:
az identity show --name <umi_name> --resource-group <umi_resource_group> --query 'clientId' --output tsv
For CCWS deployments, use ccwLockerManagedIdentity as the identity name.
Get the ingestion endpoint
Retrieve the ingestion endpoint from the deployment outputs:
jq -r '.properties.outputs.ingestionEndpoint.value' <infra_monitoring_dir>/outputs.json
Alternatively, you can find these values in the Azure portal by navigating to your Azure Monitor Workspace properties.
Get the data collection rules
You can retrieve the data collection rules from the Azure Monitor Workspace properties in the Azure portal.
Enable monitoring
You can enable monitoring during initial deployment or on an existing cluster.
Option 1: Enable during CycleCloud Workspace for Slurm deployment
When you deploy Azure CycleCloud Workspace for Slurm from the Azure Marketplace:
- Go to the Monitoring section in the deployment wizard.
- Select the checkbox to enable monitoring.
- Enter the Monitoring ingestion endpoint from your Azure Monitor Workspace properties.
- Enter the Data collection rules from your Azure Monitor Workspace properties.
Option 2: Enable in CycleCloud cluster settings
Starting with CycleCloud 8.8.1, the monitoring option is included in the Slurm default template:
- Open the CycleCloud portal and go to your cluster.
- Select Edit on the cluster configuration.
- Go to the Monitoring tab.
- Enable monitoring and provide the following values:
- Client ID: The client ID of the managed identity with
Monitoring Metrics Publisherrole (for CCWS, useccwLockerManagedIdentity) - Ingestion Endpoint: The Azure Monitor Workspace ingestion endpoint
- Client ID: The client ID of the managed identity with
Option 3: Configure manually for other cluster types
For non-Slurm cluster types, add the cluster-init project and set the monitoring parameters:
Add the following line to your cluster template after each node array configuration:
[[[cluster-init cyclecloud/monitoring:default]]]Set the monitoring parameters for each node and node array:
cyclecloud.monitoring.enabled = true cyclecloud.monitoring.identity_client_id = <client_id> cyclecloud.monitoring.ingestion_endpoint = <ingestion_endpoint>In the CycleCloud portal, select your cluster and edit the scheduler node settings.
In Software/Configuration, paste the three monitoring parameters.
Save and repeat for each node array and individual nodes.
Access Grafana dashboards
After your cluster starts with monitoring enabled, access the visualization dashboards:
- Navigate to your Azure Managed Grafana instance in the Azure portal.
- Select the Endpoint URL to open the Grafana portal.
- Expand Dashboards > Azure CycleCloud to view the available dashboards.
Available metrics
Depending on the node type, the monitoring solution collects the following metrics:
GPU metrics
For nodes with NVIDIA GPUs, track:
- GPU utilization rates
- Memory copy utilization
- Clock speeds (graphics, SM, memory)
- Temperature
- Power consumption
- ECC error counts
- NVLink throughput statistics
InfiniBand metrics
For nodes with InfiniBand networking, monitor:
- Port throughput (transmit and receive)
- Error occurrences
- Link status
System metrics
For all nodes, track:
- CPU usage and frequency
- Memory utilization
- Disk space usage
- Network activity
- File system capacity
- NFS operations and throughput
Slurm metrics
For Slurm scheduler nodes, monitor:
- Job queue statistics
- Node state information
- Partition utilization
Verify metrics collection
After you enable monitoring, verify that metrics are collected correctly.
Check metrics in Azure Monitor
- Go to your Azure Monitor Workspace in the Azure portal.
- Select Managed Prometheus > Prometheus explorer from the left panel.
- Use the
upPromQL keyword to verify that configured nodes are listed.
Verify exporters on nodes
Connect to a cluster node and run the following commands to verify each exporter:
# Node Exporter (available on all nodes)
curl -s http://localhost:9100/metrics
# DCGM Exporter (only on nodes with NVIDIA GPUs)
curl -s http://localhost:9400/metrics
# Slurm Exporter (only on Slurm scheduler node)
curl -s http://localhost:9200/metrics
Troubleshooting
Metrics aren't appearing in Azure Monitor
- Verify the managed identity has the
Monitoring Metrics Publisherrole on the Data Collection Rule. - Check that the ingestion endpoint and client ID are correctly configured.
- Ensure nodes can reach the Azure Monitor endpoint (check network security groups and firewall rules).
- Review node logs for authentication or connectivity errors.
GPU metrics are missing
- Verify NVIDIA drivers are installed:
nvidia-smi - Check DCGM exporter status:
curl -s http://localhost:9400/metrics - Ensure the node has an NVIDIA GPU attached.
InfiniBand metrics are missing
- Verify InfiniBand drivers:
ibstat - Check port status:
ibstatus - Ensure InfiniBand network connectivity.
Grafana dashboards show no data
- Verify the Grafana instance has access to the Azure Monitor Workspace.
- Check that the correct time range is selected in Grafana.
- Ensure the cluster has been running long enough for metrics to be collected and ingested.
Limitations
Azure Monitor Workspace has default ingestion limits:
- 1 million time series per minute
- 1 million events per minute
Reaching these limits results in throttling and delayed ingestion. Based on current exporters, you reach these limits at approximately:
| VM Type | Approximate Node Limit |
|---|---|
| HBv4 (176 cores) | ~125 nodes |
| NDv5 (96 cores) | ~154 nodes |
| NCv4 (48 cores) | ~285 nodes |
If you need to monitor larger clusters, request an increase in ingestion limits.