Monitor Azure Machine Learning
When you have critical applications and business processes relying on Azure resources, you want to monitor those resources for their availability, performance, and operation. This article describes the monitoring data generated by Azure Machine Learning and how to analyze and alert on this data with Azure Monitor.
Tip
The information in this document is primarily for administrators, as it describes monitoring for the Azure Machine Learning service and associated Azure services. If you are a data scientist or developer, and want to monitor information specific to your model training runs, see the following documents:
- Start, monitor, and cancel training runs
- Log metrics for training runs
- Track experiments with MLflow
If you want to monitor information generated by models deployed to online endpoints, see Monitor online endpoints.
What is Azure Monitor?
Azure Machine Learning creates monitoring data using Azure Monitor, which is a full stack monitoring service in Azure. Azure Monitor provides a complete set of features to monitor your Azure resources. It can also monitor resources in other clouds and on-premises.
Start with the article Monitoring Azure resources with Azure Monitor, which describes the following concepts:
- What is Azure Monitor?
- Costs associated with monitoring
- Monitoring data collected in Azure
- Configuring data collection
- Standard tools in Azure for analyzing and alerting on monitoring data
The following sections build on this article by describing the specific data gathered for Azure Machine Learning. These sections also provide examples for configuring data collection and analyzing this data with Azure tools.
Tip
To understand costs associated with Azure Monitor, see Azure Monitor cost and usage. To understand the time it takes for your data to appear in Azure Monitor, see Log data ingestion time.
Monitoring data from Azure Machine Learning
Azure Machine Learning collects the same kinds of monitoring data as other Azure resources that are described in Monitoring data from Azure resources.
See Azure Machine Learning monitoring data reference for a detailed reference of the logs and metrics created by Azure Machine Learning.
Collection and routing
Tip
Logs are grouped into Category groups. Category groups are a collection of different logs to help you achieve different monitoring goals. These groups are defined dynamically and may change over time as new resource logs become available and are added to the category group. Note that this may incur additional charges.
The audit resource log category group allows you to select the resource logs that are necessary for auditing your resource. For more information, see Diagnostic settings in Azure Monitor Resource logs.
Platform metrics and the Activity log are collected and stored automatically, but can be routed to other locations by using a diagnostic setting.
Resource Logs are not collected and stored until you create a diagnostic setting and route them to one or more locations. When you need to manage multiple Azure Machine Learning workspaces, you could route logs for all workspaces into the same logging destination and query all logs from a single place.
See Create diagnostic setting to collect platform logs and metrics in Azure for the detailed process for creating a diagnostic setting using the Azure portal, the Azure CLI, or PowerShell. When you create a diagnostic setting, you specify which categories of logs to collect. The categories for Azure Machine Learning are listed in Azure Machine Learning monitoring data reference.
Important
Enabling these settings requires additional Azure services (storage account, event hub, or Log Analytics), which may increase your cost. To calculate an estimated cost, visit the Azure pricing calculator.
You can configure the following logs for Azure Machine Learning:
Category | Description |
---|---|
AmlComputeClusterEvent | Events from Azure Machine Learning compute clusters. |
AmlComputeClusterNodeEvent (deprecated) | Events from nodes within an Azure Machine Learning compute cluster. |
AmlComputeJobEvent | Events from jobs running on Azure Machine Learning compute. |
AmlComputeCpuGpuUtilization | ML services compute CPU and GPU utilization logs. |
AmlOnlineEndpointTrafficLog | Logs for traffic to online endpoints. |
AmlOnlineEndpointConsoleLog | Logs that the containers for online endpoints write to the console. |
AmlOnlineEndpointEventLog | Logs for events regarding the life cycle of online endpoints. |
AmlRunStatusChangedEvent | ML run status changes. |
ModelsChangeEvent | Events when ML model is accessed created or deleted. |
ModelsReadEvent | Events when ML model is read. |
ModelsActionEvent | Events when ML model is accessed. |
DeploymentReadEvent | Events when a model deployment is read. |
DeploymentEventACI | Events when a model deployment happens on ACI (very chatty). |
DeploymentEventAKS | Events when a model deployment happens on AKS (very chatty). |
InferencingOperationAKS | Events for inference or related operation on AKS compute type. |
InferencingOperationACI | Events for inference or related operation on ACI compute type. |
EnvironmentChangeEvent | Events when ML environment configurations are created or deleted. |
EnvironmentReadEvent | Events when ML environment configurations are read (very chatty). |
DataLabelChangeEvent | Events when data label(s) or its projects is created or deleted. |
DataLabelReadEvent | Events when data label(s) or its projects is read. |
ComputeInstanceEvent | Events when ML Compute Instance is accessed (very chatty). |
DataStoreChangeEvent | Events when ML datastore is created or deleted. |
DataStoreReadEvent | Events when ML datastore is read. |
DataSetChangeEvent | Events when ML datastore is created or deleted. |
DataSetReadEvent | Events when ML datastore is read. |
PipelineChangeEvent | Events when ML pipeline draft or endpoint or module are created or deleted. |
PipelineReadEvent | Events when ML pipeline draft or endpoint or module are read. |
RunEvent | Events when ML experiments are created or deleted. |
RunReadEvent | Events when ML experiments are read. |
Note
Effective February 2022, the AmlComputeClusterNodeEvent category will be deprecated. We recommend that you instead use the AmlComputeClusterEvent category.
Note
When you enable metrics in a diagnostic setting, dimension information is not currently included as part of the information sent to a storage account, event hub, or log analytics.
The metrics and logs you can collect are discussed in the following sections.
Analyzing metrics
You can analyze metrics for Azure Machine Learning, along with metrics from other Azure services, by opening Metrics from the Azure Monitor menu. See Analyze metrics with Azure Monitor metrics explorer for details on using this tool.
For a list of the platform metrics collected, see Monitoring Azure Machine Learning data reference metrics.
All metrics for Azure Machine Learning are in the namespace Machine Learning Service Workspace.
For reference, you can see a list of all resource metrics supported in Azure Monitor.
Tip
Azure Monitor metrics data is available for 90 days. However, when creating charts only 30 days can be visualized. For example, if you want to visualize a 90 day period, you must break it into three charts of 30 days within the 90 day period.
Filtering and splitting
For metrics that support dimensions, you can apply filters using a dimension value. For example, filtering Active Cores for a Cluster Name of cpu-cluster
.
You can also split a metric by dimension to visualize how different segments of the metric compare with each other. For example, splitting out the Pipeline Step Type to see a count of the types of steps used in the pipeline.
For more information of filtering and splitting, see Advanced features of Azure Monitor.
Analyzing logs
Using Azure Monitor Log Analytics requires you to create a diagnostic configuration and enable Send information to Log Analytics. For more information, see the Collection and routing section.
Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique properties. Azure Machine Learning stores data in the following tables:
Table | Description |
---|---|
AmlComputeClusterEvent | Events from Azure Machine Learning compute clusters. |
AmlComputeClusterNodeEvent (deprecated) | Events from nodes within an Azure Machine Learning compute cluster. |
AmlComputeJobEvent | Events from jobs running on Azure Machine Learning compute. |
AmlComputeInstanceEvent | Events when ML Compute Instance is accessed (read/write). Category includes:ComputeInstanceEvent (very chatty). |
AmlDataLabelEvent | Events when data label(s) or its projects is accessed (read, created, or deleted). Category includes:DataLabelReadEvent,DataLabelChangeEvent. |
AmlDataSetEvent | Events when a registered or unregistered ML dataset is accessed (read, created, or deleted). Category includes:DataSetReadEvent,DataSetChangeEvent. |
AmlDataStoreEvent | Events when ML datastore is accessed (read, created, or deleted). Category includes:DataStoreReadEvent,DataStoreChangeEvent. |
AmlDeploymentEvent | Events when a model deployment happens on ACI or AKS. Category includes:DeploymentReadEvent,DeploymentEventACI,DeploymentEventAKS. |
AmlInferencingEvent | Events for inference or related operation on AKS or ACI compute type. Category includes:InferencingOperationACI (very chatty),InferencingOperationAKS (very chatty). |
AmlModelsEvent | Events when ML model is accessed (read, created, or deleted). Includes events when packaging of models and assets happen into ready-to-build packages. Category includes:ModelsReadEvent,ModelsActionEvent . |
AmlPipelineEvent | Events when ML pipeline draft or endpoint or module are accessed (read, created, or deleted).Category includes:PipelineReadEvent,PipelineChangeEvent. |
AmlRunEvent | Events when ML experiments are accessed (read, created, or deleted). Category includes:RunReadEvent,RunEvent. |
AmlEnvironmentEvent | Events when ML environment configurations (read, created, or deleted). Category includes:EnvironmentReadEvent (very chatty),EnvironmentChangeEvent. |
AmlOnlineEndpointTrafficLog | Logs for traffic to online endpoints. |
AmlOnlineEndpointConsoleLog | Logs that the containers for online endpoints write to the console. |
AmlOnlineEndpointEventLog | Logs for events regarding the life cycle of online endpoints. |
Note
Effective February 2022, the AmlComputeClusterNodeEvent table will be deprecated. We recommend that you instead use the AmlComputeClusterEvent table.
Important
When you select Logs from the Azure Machine Learning menu, Log Analytics is opened with the query scope set to the current workspace. This means that log queries will only include data from that resource. If you want to run a query that includes data from other databases or data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.
For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring data reference.
Sample Kusto queries
Important
When you select Logs from the [service-name] menu, Log Analytics is opened with the query scope set to the current Azure Machine Learning workspace. This means that log queries will only include data from that resource. If you want to run a query that includes data from other workspaces or data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.
Following are queries that you can use to help you monitor your Azure Machine Learning resources:
Get failed jobs in the last five days:
AmlComputeJobEvent | where TimeGenerated > ago(5d) and EventType == "JobFailed" | project TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
Get records for a specific job name:
AmlComputeJobEvent | where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup" | project TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
Get cluster events in the last five days for clusters where the VM size is Standard_D1_V2:
AmlComputeClusterEvent | where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2" | project ClusterName , InitialNodeCount , MaximumNodeCount , QuotaAllocated , QuotaUtilized
Get the cluster node allocations in the last eight days::
AmlComputeClusterEvent | where TimeGenerated > ago(8d) and TargetNodeCount > CurrentNodeCount | project TimeGenerated, ClusterName, CurrentNodeCount, TargetNodeCount
When you connect multiple Azure Machine Learning workspaces to the same Log Analytics workspace, you can query across all resources.
Get number of running nodes across workspaces and clusters in the last day:
AmlComputeClusterEvent | where TimeGenerated > ago(1d) | summarize avgRunningNodes=avg(TargetNodeCount), maxRunningNodes=max(TargetNodeCount) by Workspace=tostring(split(_ResourceId, "/")[8]), ClusterName, ClusterType, VmSize, VmPriority
Create a workspace monitoring dashboard by using a template
A dashboard is a focused and organized view of your cloud resources in the Azure portal. For more information about creating dashboards, see Create, view, and manage metric alerts using Azure Monitor.
To deploy a sample dashboard, you can use a publicly available template. The sample dashboard is based on Kusto queries, so you must enable Log Analytics data collection for your Azure Machine Learning workspace before you deploy the dashboard.
Alerts
You can access alerts for Azure Machine Learning by opening Alerts from the Azure Monitor menu. See Create, view, and manage metric alerts using Azure Monitor for details on creating alerts.
The following table lists common and recommended metric alert rules for Azure Machine Learning:
Alert type | Condition | Description |
---|---|---|
Model Deploy Failed | Aggregation type: Total, Operator: Greater than, Threshold value: 0 | When one or more model deployments have failed |
Quota Utilization Percentage | Aggregation type: Average, Operator: Greater than, Threshold value: 90 | When the quota utilization percentage is greater than 90% |
Unusable Nodes | Aggregation type: Total, Operator: Greater than, Threshold value: 0 | When there are one or more unusable nodes |
Next steps
- For a reference of the logs and metrics, see Monitoring Azure Machine Learning data reference.
- For information on working with quotas related to Azure Machine Learning, see Manage and request quotas for Azure resources.
- For details on monitoring Azure resources, see Monitoring Azure resources with Azure Monitor.
Feedback
Submit and view feedback for