Monitor Azure Machine Learning

When you have critical applications and business processes relying on Azure resources, you want to monitor those resources for their availability, performance, and operation. This article describes the monitoring data generated by Azure Machine Learning and how to analyze and alert on this data with Azure Monitor.

Tip

The information in this document is primarily for administrators, as it describes monitoring for the Azure Machine Learning service and associated Azure services. If you are a data scientist or developer, and want to monitor information specific to your model training runs, see the following documents:

If you want to monitor information generated by models deployed to online endpoints, see Monitor online endpoints.

What is Azure Monitor?

Azure Machine Learning creates monitoring data using Azure Monitor, which is a full stack monitoring service in Azure. Azure Monitor provides a complete set of features to monitor your Azure resources. It can also monitor resources in other clouds and on-premises.

Start with the article Monitoring Azure resources with Azure Monitor, which describes the following concepts:

  • What is Azure Monitor?
  • Costs associated with monitoring
  • Monitoring data collected in Azure
  • Configuring data collection
  • Standard tools in Azure for analyzing and alerting on monitoring data

The following sections build on this article by describing the specific data gathered for Azure Machine Learning. These sections also provide examples for configuring data collection and analyzing this data with Azure tools.

Tip

To understand costs associated with Azure Monitor, see Azure Monitor cost and usage. To understand the time it takes for your data to appear in Azure Monitor, see Log data ingestion time.

Monitoring data from Azure Machine Learning

Azure Machine Learning collects the same kinds of monitoring data as other Azure resources that are described in Monitoring data from Azure resources.

See Azure Machine Learning monitoring data reference for a detailed reference of the logs and metrics created by Azure Machine Learning.

Collection and routing

Tip

Logs are grouped into Category groups. Category groups are a collection of different logs to help you achieve different monitoring goals. These groups are defined dynamically and may change over time as new resource logs become available and are added to the category group. Note that this may incur additional charges.

The audit resource log category group allows you to select the resource logs that are necessary for auditing your resource. For more information, see Diagnostic settings in Azure Monitor Resource logs.

Platform metrics and the Activity log are collected and stored automatically, but can be routed to other locations by using a diagnostic setting.

Resource Logs are not collected and stored until you create a diagnostic setting and route them to one or more locations. When you need to manage multiple Azure Machine Learning workspaces, you could route logs for all workspaces into the same logging destination and query all logs from a single place.

See Create diagnostic setting to collect platform logs and metrics in Azure for the detailed process for creating a diagnostic setting using the Azure portal, the Azure CLI, or PowerShell. When you create a diagnostic setting, you specify which categories of logs to collect. The categories for Azure Machine Learning are listed in Azure Machine Learning monitoring data reference.

Important

Enabling these settings requires additional Azure services (storage account, event hub, or Log Analytics), which may increase your cost. To calculate an estimated cost, visit the Azure pricing calculator.

You can configure the following logs for Azure Machine Learning:

Category Description
AmlComputeClusterEvent Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEvent (deprecated) Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEvent Events from jobs running on Azure Machine Learning compute.
AmlComputeCpuGpuUtilization ML services compute CPU and GPU utilization logs.
AmlOnlineEndpointTrafficLog Logs for traffic to online endpoints.
AmlOnlineEndpointConsoleLog Logs that the containers for online endpoints write to the console.
AmlOnlineEndpointEventLog Logs for events regarding the life cycle of online endpoints.
AmlRunStatusChangedEvent ML run status changes.
ModelsChangeEvent Events when ML model is accessed created or deleted.
ModelsReadEvent Events when ML model is read.
ModelsActionEvent Events when ML model is accessed.
DeploymentReadEvent Events when a model deployment is read.
DeploymentEventACI Events when a model deployment happens on ACI (very chatty).
DeploymentEventAKS Events when a model deployment happens on AKS (very chatty).
InferencingOperationAKS Events for inference or related operation on AKS compute type.
InferencingOperationACI Events for inference or related operation on ACI compute type.
EnvironmentChangeEvent Events when ML environment configurations are created or deleted.
EnvironmentReadEvent Events when ML environment configurations are read (very chatty).
DataLabelChangeEvent Events when data label(s) or its projects is created or deleted.
DataLabelReadEvent Events when data label(s) or its projects is read.
ComputeInstanceEvent Events when ML Compute Instance is accessed (very chatty).
DataStoreChangeEvent Events when ML datastore is created or deleted.
DataStoreReadEvent Events when ML datastore is read.
DataSetChangeEvent Events when ML datastore is created or deleted.
DataSetReadEvent Events when ML datastore is read.
PipelineChangeEvent Events when ML pipeline draft or endpoint or module are created or deleted.
PipelineReadEvent Events when ML pipeline draft or endpoint or module are read.
RunEvent Events when ML experiments are created or deleted.
RunReadEvent Events when ML experiments are read.

Note

Effective February 2022, the AmlComputeClusterNodeEvent category will be deprecated. We recommend that you instead use the AmlComputeClusterEvent category.

Note

When you enable metrics in a diagnostic setting, dimension information is not currently included as part of the information sent to a storage account, event hub, or log analytics.

The metrics and logs you can collect are discussed in the following sections.

Analyzing metrics

You can analyze metrics for Azure Machine Learning, along with metrics from other Azure services, by opening Metrics from the Azure Monitor menu. See Analyze metrics with Azure Monitor metrics explorer for details on using this tool.

For a list of the platform metrics collected, see Monitoring Azure Machine Learning data reference metrics.

All metrics for Azure Machine Learning are in the namespace Machine Learning Service Workspace.

Metrics Explorer with Machine Learning Service Workspace selected

For reference, you can see a list of all resource metrics supported in Azure Monitor.

Tip

Azure Monitor metrics data is available for 90 days. However, when creating charts only 30 days can be visualized. For example, if you want to visualize a 90 day period, you must break it into three charts of 30 days within the 90 day period.

Filtering and splitting

For metrics that support dimensions, you can apply filters using a dimension value. For example, filtering Active Cores for a Cluster Name of cpu-cluster.

You can also split a metric by dimension to visualize how different segments of the metric compare with each other. For example, splitting out the Pipeline Step Type to see a count of the types of steps used in the pipeline.

For more information of filtering and splitting, see Advanced features of Azure Monitor.

Analyzing logs

Using Azure Monitor Log Analytics requires you to create a diagnostic configuration and enable Send information to Log Analytics. For more information, see the Collection and routing section.

Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique properties. Azure Machine Learning stores data in the following tables:

Table Description
AmlComputeClusterEvent Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEvent (deprecated) Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEvent Events from jobs running on Azure Machine Learning compute.
AmlComputeInstanceEvent Events when ML Compute Instance is accessed (read/write). Category includes:ComputeInstanceEvent (very chatty).
AmlDataLabelEvent Events when data label(s) or its projects is accessed (read, created, or deleted). Category includes:DataLabelReadEvent,DataLabelChangeEvent.
AmlDataSetEvent Events when a registered or unregistered ML dataset is accessed (read, created, or deleted). Category includes:DataSetReadEvent,DataSetChangeEvent.
AmlDataStoreEvent Events when ML datastore is accessed (read, created, or deleted). Category includes:DataStoreReadEvent,DataStoreChangeEvent.
AmlDeploymentEvent Events when a model deployment happens on ACI or AKS. Category includes:DeploymentReadEvent,DeploymentEventACI,DeploymentEventAKS.
AmlInferencingEvent Events for inference or related operation on AKS or ACI compute type. Category includes:InferencingOperationACI (very chatty),InferencingOperationAKS (very chatty).
AmlModelsEvent Events when ML model is accessed (read, created, or deleted). Includes events when packaging of models and assets happen into ready-to-build packages. Category includes:ModelsReadEvent,ModelsActionEvent .
AmlPipelineEvent Events when ML pipeline draft or endpoint or module are accessed (read, created, or deleted).Category includes:PipelineReadEvent,PipelineChangeEvent.
AmlRunEvent Events when ML experiments are accessed (read, created, or deleted). Category includes:RunReadEvent,RunEvent.
AmlEnvironmentEvent Events when ML environment configurations (read, created, or deleted). Category includes:EnvironmentReadEvent (very chatty),EnvironmentChangeEvent.
AmlOnlineEndpointTrafficLog Logs for traffic to online endpoints.
AmlOnlineEndpointConsoleLog Logs that the containers for online endpoints write to the console.
AmlOnlineEndpointEventLog Logs for events regarding the life cycle of online endpoints.

Note

Effective February 2022, the AmlComputeClusterNodeEvent table will be deprecated. We recommend that you instead use the AmlComputeClusterEvent table.

Important

When you select Logs from the Azure Machine Learning menu, Log Analytics is opened with the query scope set to the current workspace. This means that log queries will only include data from that resource. If you want to run a query that includes data from other databases or data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.

For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring data reference.

Sample Kusto queries

Important

When you select Logs from the [service-name] menu, Log Analytics is opened with the query scope set to the current Azure Machine Learning workspace. This means that log queries will only include data from that resource. If you want to run a query that includes data from other workspaces or data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.

Following are queries that you can use to help you monitor your Azure Machine Learning resources:

  • Get failed jobs in the last five days:

    AmlComputeJobEvent
    | where TimeGenerated > ago(5d) and EventType == "JobFailed"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • Get records for a specific job name:

    AmlComputeJobEvent
    | where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • Get cluster events in the last five days for clusters where the VM size is Standard_D1_V2:

    AmlComputeClusterEvent
    | where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2"
    | project  ClusterName , InitialNodeCount , MaximumNodeCount , QuotaAllocated , QuotaUtilized
    
  • Get the cluster node allocations in the last eight days::

    AmlComputeClusterEvent
    | where TimeGenerated > ago(8d) and TargetNodeCount  > CurrentNodeCount
    | project TimeGenerated, ClusterName, CurrentNodeCount, TargetNodeCount
    

When you connect multiple Azure Machine Learning workspaces to the same Log Analytics workspace, you can query across all resources.

  • Get number of running nodes across workspaces and clusters in the last day:

    AmlComputeClusterEvent
    | where TimeGenerated > ago(1d)
    | summarize avgRunningNodes=avg(TargetNodeCount), maxRunningNodes=max(TargetNodeCount)
             by Workspace=tostring(split(_ResourceId, "/")[8]), ClusterName, ClusterType, VmSize, VmPriority
    

Create a workspace monitoring dashboard by using a template

A dashboard is a focused and organized view of your cloud resources in the Azure portal. For more information about creating dashboards, see Create, view, and manage metric alerts using Azure Monitor.

To deploy a sample dashboard, you can use a publicly available template. The sample dashboard is based on Kusto queries, so you must enable Log Analytics data collection for your Azure Machine Learning workspace before you deploy the dashboard.

Alerts

You can access alerts for Azure Machine Learning by opening Alerts from the Azure Monitor menu. See Create, view, and manage metric alerts using Azure Monitor for details on creating alerts.

The following table lists common and recommended metric alert rules for Azure Machine Learning:

Alert type Condition Description
Model Deploy Failed Aggregation type: Total, Operator: Greater than, Threshold value: 0 When one or more model deployments have failed
Quota Utilization Percentage Aggregation type: Average, Operator: Greater than, Threshold value: 90 When the quota utilization percentage is greater than 90%
Unusable Nodes Aggregation type: Total, Operator: Greater than, Threshold value: 0 When there are one or more unusable nodes

Next steps