Azure Machine Learning monitoring data reference
This article contains all the monitoring reference information for this service.
See Monitor Machine Learning for details on the data you can collect for Azure Machine Learning and how to use it.
Metrics
This section lists all the automatically collected platform metrics for this service. These metrics are also part of the global list of all platform metrics supported in Azure Monitor.
For information on metric retention, see Azure Monitor Metrics overview.
The resource provider for these metrics is Microsoft.MachineLearningServices/workspaces.
The metrics categories are Model, Quota, Resource, Run, and Traffic. Quota information is for Machine Learning compute only. Run provides information on training runs for the workspace.
Supported metrics for Microsoft.MachineLearningServices/workspaces
The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces resource type.
- All columns might not be present in every table.
- Some columns might be beyond the viewing area of the page. Select Expand table to view all available columns.
Table headings
- Category - The metrics group or classification.
- Metric - The metric display name as it appears in the Azure portal.
- Name in REST API - The metric name as referred to in the REST API.
- Unit - Unit of measure.
- Aggregation - The default aggregation type. Valid values: Average (Avg), Minimum (Min), Maximum (Max), Total (Sum), Count.
- Dimensions - Dimensions available for the metric.
- Time Grains - Intervals at which the metric is sampled. For example,
PT1M
indicates that the metric is sampled every minute,PT30M
every 30 minutes,PT1H
every hour, and so on. - DS Export- Whether the metric is exportable to Azure Monitor Logs via diagnostic settings. For information on exporting metrics, see Create diagnostic settings in Azure Monitor.
Category | Metric | Name in REST API | Unit | Aggregation | Dimensions | Time Grains | DS Export |
---|---|---|---|---|---|---|---|
Quota | Active Cores Number of active cores |
Active Cores |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Active Nodes Number of Acitve nodes. These are the nodes which are actively running a job. |
Active Nodes |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Run | Cancel Requested Runs Number of runs where cancel was requested for this workspace. Count is updated when cancellation request has been received for a run. |
Cancel Requested Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Cancelled Runs Number of runs cancelled for this workspace. Count is updated when a run is successfully cancelled. |
Cancelled Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Completed Runs Number of runs completed successfully for this workspace. Count is updated when a run has completed and output has been collected. |
Completed Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Resource | CpuCapacityMillicores Maximum capacity of a CPU node in millicores. Capacity is aggregated in one minute intervals. |
CpuCapacityMillicores |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | CpuMemoryCapacityMegabytes Maximum memory utilization of a CPU node in megabytes. Utilization is aggregated in one minute intervals. |
CpuMemoryCapacityMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | CpuMemoryUtilizationMegabytes Memory utilization of a CPU node in megabytes. Utilization is aggregated in one minute intervals. |
CpuMemoryUtilizationMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | CpuMemoryUtilizationPercentage Memory utilization percentage of a CPU node. Utilization is aggregated in one minute intervals. |
CpuMemoryUtilizationPercentage |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | CpuUtilization Percentage of utilization on a CPU node. Utilization is reported at one minute intervals. |
CpuUtilization |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , runId , NodeId , ClusterName |
PT1M | Yes |
Resource | CpuUtilizationMillicores Utilization of a CPU node in millicores. Utilization is aggregated in one minute intervals. |
CpuUtilizationMillicores |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | CpuUtilizationPercentage Utilization percentage of a CPU node. Utilization is aggregated in one minute intervals. |
CpuUtilizationPercentage |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | DiskAvailMegabytes Available disk space in megabytes. Metrics are aggregated in one minute intervals. |
DiskAvailMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | DiskReadMegabytes Data read from disk in megabytes. Metrics are aggregated in one minute intervals. |
DiskReadMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | DiskUsedMegabytes Used disk space in megabytes. Metrics are aggregated in one minute intervals. |
DiskUsedMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | DiskWriteMegabytes Data written into disk in megabytes. Metrics are aggregated in one minute intervals. |
DiskWriteMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Run | Errors Number of run errors in this workspace. Count is updated whenever run encounters an error. |
Errors |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario |
PT1M | Yes |
Run | Failed Runs Number of runs failed for this workspace. Count is updated when a run fails. |
Failed Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Finalizing Runs Number of runs entered finalizing state for this workspace. Count is updated when a run has completed but output collection still in progress. |
Finalizing Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Resource | GpuCapacityMilliGPUs Maximum capacity of a GPU device in milli-GPUs. Capacity is aggregated in one minute intervals. |
GpuCapacityMilliGPUs |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | GpuEnergyJoules Interval energy in Joules on a GPU node. Energy is reported at one minute intervals. |
GpuEnergyJoules |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , runId , rootRunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | GpuMemoryCapacityMegabytes Maximum memory capacity of a GPU device in megabytes. Capacity aggregated in at one minute intervals. |
GpuMemoryCapacityMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | GpuMemoryUtilization Percentage of memory utilization on a GPU node. Utilization is reported at one minute intervals. |
GpuMemoryUtilization |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , runId , NodeId , DeviceId , ClusterName |
PT1M | Yes |
Resource | GpuMemoryUtilizationMegabytes Memory utilization of a GPU device in megabytes. Utilization aggregated in at one minute intervals. |
GpuMemoryUtilizationMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | GpuMemoryUtilizationPercentage Memory utilization percentage of a GPU device. Utilization aggregated in at one minute intervals. |
GpuMemoryUtilizationPercentage |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | GpuUtilization Percentage of utilization on a GPU node. Utilization is reported at one minute intervals. |
GpuUtilization |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , runId , NodeId , DeviceId , ClusterName |
PT1M | Yes |
Resource | GpuUtilizationMilliGPUs Utilization of a GPU device in milli-GPUs. Utilization is aggregated in one minute intervals. |
GpuUtilizationMilliGPUs |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | GpuUtilizationPercentage Utilization percentage of a GPU device. Utilization is aggregated in one minute intervals. |
GpuUtilizationPercentage |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , DeviceId , ComputeName |
PT1M | Yes |
Resource | IBReceiveMegabytes Network data received over InfiniBand in megabytes. Metrics are aggregated in one minute intervals. |
IBReceiveMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName , DeviceId |
PT1M | Yes |
Resource | IBTransmitMegabytes Network data sent over InfiniBand in megabytes. Metrics are aggregated in one minute intervals. |
IBTransmitMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName , DeviceId |
PT1M | Yes |
Quota | Idle Cores Number of idle cores |
Idle Cores |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Idle Nodes Number of idle nodes. Idle nodes are the nodes which are not running any jobs but can accept new job if available. |
Idle Nodes |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Leaving Cores Number of leaving cores |
Leaving Cores |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Leaving Nodes Number of leaving nodes. Leaving nodes are the nodes which just finished processing a job and will go to Idle state. |
Leaving Nodes |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Model | Model Deploy Failed Number of model deployments that failed in this workspace |
Model Deploy Failed |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , StatusCode |
PT1M | Yes |
Model | Model Deploy Started Number of model deployments started in this workspace |
Model Deploy Started |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario |
PT1M | Yes |
Model | Model Deploy Succeeded Number of model deployments that succeeded in this workspace |
Model Deploy Succeeded |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario |
PT1M | Yes |
Model | Model Register Failed Number of model registrations that failed in this workspace |
Model Register Failed |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , StatusCode |
PT1M | Yes |
Model | Model Register Succeeded Number of model registrations that succeeded in this workspace |
Model Register Succeeded |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario |
PT1M | Yes |
Resource | NetworkInputMegabytes Network data received in megabytes. Metrics are aggregated in one minute intervals. |
NetworkInputMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName , DeviceId |
PT1M | Yes |
Resource | NetworkOutputMegabytes Network data sent in megabytes. Metrics are aggregated in one minute intervals. |
NetworkOutputMegabytes |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName , DeviceId |
PT1M | Yes |
Run | Not Responding Runs Number of runs not responding for this workspace. Count is updated when a run enters Not Responding state. |
Not Responding Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Not Started Runs Number of runs in Not Started state for this workspace. Count is updated when a request is received to create a run but run information has not yet been populated. |
Not Started Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Quota | Preempted Cores Number of preempted cores |
Preempted Cores |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Preempted Nodes Number of preempted nodes. These nodes are the low priority nodes which are taken away from the available node pool. |
Preempted Nodes |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Run | Preparing Runs Number of runs that are preparing for this workspace. Count is updated when a run enters Preparing state while the run environment is being prepared. |
Preparing Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Provisioning Runs Number of runs that are provisioning for this workspace. Count is updated when a run is waiting on compute target creation or provisioning. |
Provisioning Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Queued Runs Number of runs that are queued for this workspace. Count is updated when a run is queued in compute target. Can occure when waiting for required compute nodes to be ready. |
Queued Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Quota | Quota Utilization Percentage Percent of quota utilized |
Quota Utilization Percentage |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName , VmFamilyName , VmPriority |
PT1M | Yes |
Run | Started Runs Number of runs running for this workspace. Count is updated when run starts running on required resources. |
Started Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Run | Starting Runs Number of runs started for this workspace. Count is updated after request to create run and run info, such as the Run Id, has been populated |
Starting Runs |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario , RunType , PublishedPipelineId , ComputeType , PipelineStepType , ExperimentName |
PT1M | Yes |
Resource | StorageAPIFailureCount Azure Blob Storage API calls failure count. |
StorageAPIFailureCount |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Resource | StorageAPISuccessCount Azure Blob Storage API calls success count. |
StorageAPISuccessCount |
Count | Average, Maximum, Minimum, Total (Sum) | RunId , InstanceId , ComputeName |
PT1M | Yes |
Quota | Total Cores Number of total cores |
Total Cores |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Total Nodes Number of total nodes. This total includes some of Active Nodes, Idle Nodes, Unusable Nodes, Premepted Nodes, Leaving Nodes |
Total Nodes |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Unusable Cores Number of unusable cores |
Unusable Cores |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Quota | Unusable Nodes Number of unusable nodes. Unusable nodes are not functional due to some unresolvable issue. Azure will recycle these nodes. |
Unusable Nodes |
Count | Average, Maximum, Minimum, Total (Sum) | Scenario , ClusterName |
PT1M | Yes |
Run | Warnings Number of run warnings in this workspace. Count is updated whenever a run encounters a warning. |
Warnings |
Count | Total (Sum), Average, Minimum, Maximum, Count | Scenario |
PT1M | Yes |
Supported metrics for Microsoft.MachineLearningServices/workspaces/onlineEndpoints
The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints resource type.
- All columns might not be present in every table.
- Some columns might be beyond the viewing area of the page. Select Expand table to view all available columns.
Table headings
- Category - The metrics group or classification.
- Metric - The metric display name as it appears in the Azure portal.
- Name in REST API - The metric name as referred to in the REST API.
- Unit - Unit of measure.
- Aggregation - The default aggregation type. Valid values: Average (Avg), Minimum (Min), Maximum (Max), Total (Sum), Count.
- Dimensions - Dimensions available for the metric.
- Time Grains - Intervals at which the metric is sampled. For example,
PT1M
indicates that the metric is sampled every minute,PT30M
every 30 minutes,PT1H
every hour, and so on. - DS Export- Whether the metric is exportable to Azure Monitor Logs via diagnostic settings. For information on exporting metrics, see Create diagnostic settings in Azure Monitor.
Category | Metric | Name in REST API | Unit | Aggregation | Dimensions | Time Grains | DS Export |
---|---|---|---|---|---|---|---|
Traffic | Connections Active The total number of concurrent TCP connections active from clients. |
ConnectionsActive |
Count | Average | <none> | PT1M | No |
Traffic | Data Collection Errors Per Minute The number of data collection events dropped per minute. |
DataCollectionErrorsPerMinute |
Count | Minimum, Maximum, Average | deployment , reason , type |
PT1M | No |
Traffic | Data Collection Events Per Minute The number of data collection events processed per minute. |
DataCollectionEventsPerMinute |
Count | Minimum, Maximum, Average | deployment , type |
PT1M | No |
Traffic | Network Bytes The bytes per second served for the endpoint. |
NetworkBytes |
BytesPerSecond | Average | <none> | PT1M | No |
Traffic | New Connections Per Second The average number of new TCP connections per second established from clients. |
NewConnectionsPerSecond |
CountPerSecond | Average | <none> | PT1M | No |
Traffic | Request Latency The average complete interval of time taken for a request to be responded in milliseconds |
RequestLatency |
Milliseconds | Average | deployment |
PT1M | Yes |
Traffic | Request Latency P50 The average P50 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P50 |
Milliseconds | Average | deployment |
PT1M | Yes |
Traffic | Request Latency P90 The average P90 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P90 |
Milliseconds | Average | deployment |
PT1M | Yes |
Traffic | Request Latency P95 The average P95 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P95 |
Milliseconds | Average | deployment |
PT1M | Yes |
Traffic | Request Latency P99 The average P99 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P99 |
Milliseconds | Average | deployment |
PT1M | Yes |
Traffic | Requests Per Minute The number of requests sent to online endpoint within a minute |
RequestsPerMinute |
Count | Average | deployment , statusCode , statusCodeClass , modelStatusCode |
PT1M | No |
Supported metrics for Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments
The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments resource type.
- All columns might not be present in every table.
- Some columns might be beyond the viewing area of the page. Select Expand table to view all available columns.
Table headings
- Category - The metrics group or classification.
- Metric - The metric display name as it appears in the Azure portal.
- Name in REST API - The metric name as referred to in the REST API.
- Unit - Unit of measure.
- Aggregation - The default aggregation type. Valid values: Average (Avg), Minimum (Min), Maximum (Max), Total (Sum), Count.
- Dimensions - Dimensions available for the metric.
- Time Grains - Intervals at which the metric is sampled. For example,
PT1M
indicates that the metric is sampled every minute,PT30M
every 30 minutes,PT1H
every hour, and so on. - DS Export- Whether the metric is exportable to Azure Monitor Logs via diagnostic settings. For information on exporting metrics, see Create diagnostic settings in Azure Monitor.
Category | Metric | Name in REST API | Unit | Aggregation | Dimensions | Time Grains | DS Export |
---|---|---|---|---|---|---|---|
Resource | CPU Memory Utilization Percentage Percentage of memory utilization on an instance. Utilization is reported at one minute intervals. |
CpuMemoryUtilizationPercentage |
Percent | Minimum, Maximum, Average | instanceId |
PT1M | Yes |
Resource | CPU Utilization Percentage Percentage of CPU utilization on an instance. Utilization is reported at one minute intervals. |
CpuUtilizationPercentage |
Percent | Minimum, Maximum, Average | instanceId |
PT1M | Yes |
Resource | Data Collection Errors Per Minute The number of data collection events dropped per minute. |
DataCollectionErrorsPerMinute |
Count | Minimum, Maximum, Average | instanceId , reason , type |
PT1M | No |
Resource | Data Collection Events Per Minute The number of data collection events processed per minute. |
DataCollectionEventsPerMinute |
Count | Minimum, Maximum, Average | instanceId , type |
PT1M | No |
Resource | Deployment Capacity The number of instances in the deployment. |
DeploymentCapacity |
Count | Minimum, Maximum, Average | instanceId , State |
PT1M | No |
Resource | Disk Utilization Percentage of disk utilization on an instance. Utilization is reported at one minute intervals. |
DiskUtilization |
Percent | Minimum, Maximum, Average | instanceId , disk |
PT1M | Yes |
Resource | GPU Energy in Joules Interval energy in Joules on a GPU node. Energy is reported at one minute intervals. |
GpuEnergyJoules |
Count | Minimum, Maximum, Average | instanceId |
PT1M | No |
Resource | GPU Memory Utilization Percentage Percentage of GPU memory utilization on an instance. Utilization is reported at one minute intervals. |
GpuMemoryUtilizationPercentage |
Percent | Minimum, Maximum, Average | instanceId |
PT1M | Yes |
Resource | GPU Utilization Percentage Percentage of GPU utilization on an instance. Utilization is reported at one minute intervals. |
GpuUtilizationPercentage |
Percent | Minimum, Maximum, Average | instanceId |
PT1M | Yes |
Traffic | Request Latency P50 The average P50 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P50 |
Milliseconds | Average | <none> | PT1M | Yes |
Traffic | Request Latency P90 The average P90 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P90 |
Milliseconds | Average | <none> | PT1M | Yes |
Traffic | Request Latency P95 The average P95 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P95 |
Milliseconds | Average | <none> | PT1M | Yes |
Traffic | Request Latency P99 The average P99 request latency aggregated by all request latency values collected over the selected time period |
RequestLatency_P99 |
Milliseconds | Average | <none> | PT1M | Yes |
Traffic | Requests Per Minute The number of requests sent to online deployment within a minute |
RequestsPerMinute |
Count | Average | envoy_response_code |
PT1M | No |
Metric dimensions
For information about what metric dimensions are, see Multi-dimensional metrics.
This service has the following dimensions associated with its metrics.
Dimension | Description |
---|---|
Cluster Name | The name of the compute cluster resource. Available for all quota metrics. |
Vm Family Name | The name of the VM family used by the cluster. Available for quota utilization percentage. |
Vm Priority | The priority of the VM. Available for quota utilization percentage. |
CreatedTime | Only available for CpuUtilization and GpuUtilization. |
DeviceId | ID of the device (GPU). Only available for GpuUtilization. |
NodeId | ID of the node created where job is running. Only available for CpuUtilization and GpuUtilization. |
RunId | ID of the run/job. Only available for CpuUtilization and GpuUtilization. |
ComputeType | The compute type that the run used. Only available for Completed runs, Failed runs, and Started runs. |
PipelineStepType | The type of PipelineStep used in the run. Only available for Completed runs, Failed runs, and Started runs. |
PublishedPipelineId | The ID of the published pipeline used in the run. Only available for Completed runs, Failed runs, and Started runs. |
RunType | The type of run. Only available for Completed runs, Failed runs, and Started runs. |
The valid values for the RunType dimension are:
Value | Description |
---|---|
Experiment | Non-pipeline runs. |
PipelineRun | A pipeline run, which is the parent of a StepRun. |
StepRun | A run for a pipeline step. |
ReusedStepRun | A run for a pipeline step that reuses a previous run. |
Resource logs
This section lists the types of resource logs you can collect for this service. The section pulls from the list of all resource logs category types supported in Azure Monitor.
Supported resource logs for Microsoft.MachineLearningServices/registries
Category | Category display name | Log table | Supports basic log plan | Supports ingestion-time transformation | Example queries | Costs to export |
---|---|---|---|---|---|---|
RegistryAssetReadEvent |
Registry Asset Read Event | No | No | Yes | ||
RegistryAssetWriteEvent |
Registry Asset Write Event | AmlRegistryWriteEventsLog Azure ML Registry Write events log. It keeps records of Write operations with registries data access (data plane), including user identity, asset name and version for each access event. |
No | No | Queries | Yes |
Supported resource logs for Microsoft.MachineLearningServices/workspaces
Category | Category display name | Log table | Supports basic log plan | Supports ingestion-time transformation | Example queries | Costs to export |
---|---|---|---|---|---|---|
AmlComputeClusterEvent |
AmlComputeClusterEvent | AmlComputeClusterEvent AmlCompute Cluster events |
No | Yes | Queries | No |
AmlComputeClusterNodeEvent |
AmlComputeClusterNodeEvent | No | No | Yes | ||
AmlComputeCpuGpuUtilization |
AmlComputeCpuGpuUtilization | AmlComputeCpuGpuUtilization Azure Machine Learning services CPU and GPU utilizaion logs. |
No | Yes | Queries | No |
AmlComputeJobEvent |
AmlComputeJobEvent | AmlComputeJobEvent AmlCompute Job events |
No | Yes | Queries | No |
AmlRunStatusChangedEvent |
AmlRunStatusChangedEvent | AmlRunStatusChangedEvent Azure Machine Learning services run status event logs. |
No | Yes | No | |
ComputeInstanceEvent |
ComputeInstanceEvent | AmlComputeInstanceEvent Events when ML Compute Instance is accessed (read/write). |
No | Yes | Yes | |
DataLabelChangeEvent |
DataLabelChangeEvent | AmlDataLabelEvent Events when data label(s) or its projects is accessed (read, created, or deleted). |
No | Yes | Yes | |
DataLabelReadEvent |
DataLabelReadEvent | AmlDataLabelEvent Events when data label(s) or its projects is accessed (read, created, or deleted). |
No | Yes | Yes | |
DataSetChangeEvent |
DataSetChangeEvent | AmlDataSetEvent Events when a registered or unregistered ML datastore is accessed (read, created, or deleted). |
No | Yes | Queries | Yes |
DataSetReadEvent |
DataSetReadEvent | AmlDataSetEvent Events when a registered or unregistered ML datastore is accessed (read, created, or deleted). |
No | Yes | Queries | Yes |
DataStoreChangeEvent |
DataStoreChangeEvent | AmlDataStoreEvent Events when ML datastore is accessed (read, created, or deleted). |
No | Yes | Yes | |
DataStoreReadEvent |
DataStoreReadEvent | AmlDataStoreEvent Events when ML datastore is accessed (read, created, or deleted). |
No | Yes | Yes | |
DeploymentEventACI |
DeploymentEventACI | AmlDeploymentEvent Events when a model deployment happens on ACI or AKS. |
No | Yes | Yes | |
DeploymentEventAKS |
DeploymentEventAKS | AmlDeploymentEvent Events when a model deployment happens on ACI or AKS. |
No | Yes | Yes | |
DeploymentReadEvent |
DeploymentReadEvent | AmlDeploymentEvent Events when a model deployment happens on ACI or AKS. |
No | Yes | Yes | |
EnvironmentChangeEvent |
EnvironmentChangeEvent | AmlEnvironmentEvent Events when ML environments are accessed (read, created, or deleted). |
No | Yes | Queries | Yes |
EnvironmentReadEvent |
EnvironmentReadEvent | AmlEnvironmentEvent Events when ML environments are accessed (read, created, or deleted). |
No | Yes | Queries | Yes |
InferencingOperationACI |
InferencingOperationACI | AmlInferencingEvent Events for inference or related operation on AKS or ACI compute type. |
No | Yes | Yes | |
InferencingOperationAKS |
InferencingOperationAKS | AmlInferencingEvent Events for inference or related operation on AKS or ACI compute type. |
No | Yes | Yes | |
ModelsActionEvent |
ModelsActionEvent | AmlModelsEvent Events when ML model is accessed (read, created, or deleted). Incudes events when packaging of models and assets happen into a ready-to-build packages. |
No | Yes | Queries | Yes |
ModelsChangeEvent |
ModelsChangeEvent | AmlModelsEvent Events when ML model is accessed (read, created, or deleted). Incudes events when packaging of models and assets happen into a ready-to-build packages. |
No | Yes | Queries | Yes |
ModelsReadEvent |
ModelsReadEvent | AmlModelsEvent Events when ML model is accessed (read, created, or deleted). Incudes events when packaging of models and assets happen into a ready-to-build packages. |
No | Yes | Queries | Yes |
PipelineChangeEvent |
PipelineChangeEvent | AmlPipelineEvent Events when ML pipeline draft or endpoint or module are accessed (read, created, or deleted). |
No | Yes | Yes | |
PipelineReadEvent |
PipelineReadEvent | AmlPipelineEvent Events when ML pipeline draft or endpoint or module are accessed (read, created, or deleted). |
No | Yes | Yes | |
RunEvent |
RunEvent | AmlRunEvent Events when ML experiments are accessed (read, created, or deleted). |
No | Yes | Yes | |
RunReadEvent |
RunReadEvent | AmlRunEvent Events when ML experiments are accessed (read, created, or deleted). |
No | Yes | Yes |
Supported resource logs for Microsoft.MachineLearningServices/workspaces/onlineEndpoints
Category | Category display name | Log table | Supports basic log plan | Supports ingestion-time transformation | Example queries | Costs to export |
---|---|---|---|---|---|---|
AmlOnlineEndpointConsoleLog |
AmlOnlineEndpointConsoleLog | AmlOnlineEndpointConsoleLog Azure ML online endpoints console logs. It provides console logs output from user containers. |
No | Yes | Queries | Yes |
AmlOnlineEndpointEventLog |
AmlOnlineEndpointEventLog | AmlOnlineEndpointEventLog Azure ML online endpoints event logs. It provides event logs regarding the inference-server container's life cycle. |
No | No | Queries | Yes |
AmlOnlineEndpointTrafficLog |
AmlOnlineEndpointTrafficLog | AmlOnlineEndpointTrafficLog Traffic logs for AzureML (machine learning) online endpoints. The table could be used to check the detailed information of the request to an online endpoint. For example, you could use it to check the request duration, the request failure reason, etc. |
No | No | Queries | Yes |
Azure Monitor Logs tables
This section lists the Azure Monitor Logs tables relevant to this service, which are available for query by Log Analytics using Kusto queries. The tables contain resource log data and possibly more depending on what is collected and routed to them.
Machine Learning
Microsoft.MachineLearningServices/workspaces
- AzureActivity
- AMLOnlineEndpointConsoleLog
- AMLOnlineEndpointTrafficLog
- AMLOnlineEndpointEventLog
- AzureMetrics
- AMLComputeClusterEvent
- AMLComputeClusterNodeEvent
- AMLComputeJobEvent
- AMLRunStatusChangedEvent
- AMLComputeCpuGpuUtilization
- AMLComputeInstanceEvent
- AMLDataLabelEvent
- AMLDataSetEvent
- AMLDataStoreEvent
- AMLDeploymentEvent
- AMLEnvironmentEvent
- AMLInferencingEvent
- AMLModelsEvent
- AMLPipelineEvent
- AMLRunEvent
Microsoft.MachineLearningServices/registries
Activity log
The linked table lists the operations that can be recorded in the activity log for this service. These operations are a subset of all the possible resource provider operations in the activity log.
For more information on the schema of activity log entries, see Activity Log schema.
The following table lists some operations related to Machine Learning that may be created in the activity log. For a complete listing of Microsoft.MachineLearningServices operations, see Microsoft.MachineLearningServices resource provider operations.
Operation | Description |
---|---|
Creates or updates a Machine Learning workspace | A workspace was created or updated |
CheckComputeNameAvailability | Check if a compute name is already in use |
Creates or updates the compute resources | A compute resource was created or updated |
Deletes the compute resources | A compute resource was deleted |
List secrets | On operation listed secrets for a Machine Learning workspace |
Log schemas
Azure Machine Learning uses the following schemas.
AmlComputeJobEvent table
Property | Description |
---|---|
TimeGenerated | Time when the log entry was generated |
OperationName | Name of the operation associated with the log event |
Category | Name of the log event |
JobId | ID of the Job submitted |
ExperimentId | ID of the Experiment |
ExperimentName | Name of the Experiment |
CustomerSubscriptionId | SubscriptionId where Experiment and Job as submitted |
WorkspaceName | Name of the machine learning workspace |
ClusterName | Name of the Cluster |
ProvisioningState | State of the Job submission |
ResourceGroupName | Name of the resource group |
JobName | Name of the Job |
ClusterId | ID of the cluster |
EventType | Type of the Job event. For example, JobSubmitted, JobRunning, JobFailed, JobSucceeded. |
ExecutionState | State of the job (the Run). For example, Queued, Running, Succeeded, Failed |
ErrorDetails | Details of job error |
CreationApiVersion | Api version used to create the job |
ClusterResourceGroupName | Resource group name of the cluster |
TFWorkerCount | Count of TF workers |
TFParameterServerCount | Count of TF parameter server |
ToolType | Type of tool used |
RunInContainer | Flag describing if job should be run inside a container |
JobErrorMessage | detailed message of Job error |
NodeId | ID of the node created where job is running |
AmlComputeClusterEvent table
Property | Description |
---|---|
TimeGenerated | Time when the log entry was generated |
OperationName | Name of the operation associated with the log event |
Category | Name of the log event |
ProvisioningState | Provisioning state of the cluster |
ClusterName | Name of the cluster |
ClusterType | Type of the cluster |
CreatedBy | User who created the cluster |
CoreCount | Count of the cores in the cluster |
VmSize | Vm size of the cluster |
VmPriority | Priority of the nodes created inside a cluster Dedicated/LowPriority |
ScalingType | Type of cluster scaling manual/auto |
InitialNodeCount | Initial node count of the cluster |
MinimumNodeCount | Minimum node count of the cluster |
MaximumNodeCount | Maximum node count of the cluster |
NodeDeallocationOption | How the node should be deallocated |
Publisher | Publisher of the cluster type |
Offer | Offer with which the cluster is created |
Sku | Sku of the Node/VM created inside cluster |
Version | Version of the image used while Node/VM is created |
SubnetId | SubnetId of the cluster |
AllocationState | Cluster allocation state |
CurrentNodeCount | Current node count of the cluster |
TargetNodeCount | Target node count of the cluster while scaling up/down |
EventType | Type of event during cluster creation. |
NodeIdleTimeSecondsBeforeScaleDown | Idle time in seconds before cluster is scaled down |
PreemptedNodeCount | Preempted node count of the cluster |
IsResizeGrow | Flag indicating that cluster is scaling up |
VmFamilyName | Name of the VM family of the nodes that can be created inside cluster |
LeavingNodeCount | Leaving node count of the cluster |
UnusableNodeCount | Unusable node count of the cluster |
IdleNodeCount | Idle node count of the cluster |
RunningNodeCount | Running node count of the cluster |
PreparingNodeCount | Preparing node count of the cluster |
QuotaAllocated | Allocated quota to the cluster |
QuotaUtilized | Utilized quota of the cluster |
AllocationStateTransitionTime | Transition time from one state to another |
ClusterErrorCodes | Error code received during cluster creation or scaling |
CreationApiVersion | Api version used while creating the cluster |
AmlComputeInstanceEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlComputeInstanceEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
CorrelationId | A GUID used to group together a set of related events, when applicable. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlComputeInstanceName | "The name of the compute instance associated with the log entry. |
AmlDataLabelEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlDataLabelEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
CorrelationId | A GUID used to group together a set of related events, when applicable. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlProjectId | The unique identifier of the Azure Machine Learning project. |
AmlProjectName | The name of the Azure Machine Learning project. |
AmlLabelNames | The label class names which are created for the project. |
AmlDataStoreName | The name of the data store where the project's data is stored. |
AmlDataSetEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlDataSetEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlDatasetId | The ID of the Azure Machine Learning Data Set. |
AmlDatasetName | The name of the Azure Machine Learning Data Set. |
AmlDataStoreEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlDataStoreEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlDatastoreName | The name of the Azure Machine Learning Data Store. |
AmlDeploymentEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlDeploymentEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlServiceName | The name of the Azure Machine Learning Service. |
AmlInferencingEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlInferencingEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlServiceName | The name of the Azure Machine Learning Service. |
AmlModelsEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlModelsEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
ResultSignature | The HTTP status code of the event. Typical values include 200, 201, 202 etc. |
AmlModelName | The name of the Azure Machine Learning Model. |
AmlPipelineEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlPipelineEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
AmlWorkspaceId | The name of the Azure Machine Learning workspace. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlModuleId | A GUID and unique ID of the module. |
AmlModelName | The name of the Azure Machine Learning Model. |
AmlPipelineId | The ID of the Azure Machine Learning pipeline. |
AmlParentPipelineId | The ID of the parent Azure Machine Learning pipeline (in the case of cloning). |
AmlPipelineDraftId | The ID of the Azure Machine Learning pipeline draft. |
AmlPipelineDraftName | The name of the Azure Machine Learning pipeline draft. |
AmlPipelineEndpointId | The ID of the Azure Machine Learning pipeline endpoint. |
AmlPipelineEndpointName | The name of the Azure Machine Learning pipeline endpoint. |
AmlRunEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlRunEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
RunId | The unique ID of the run. |
AmlEnvironmentEvent table
Property | Description |
---|---|
Type | Name of the log event, AmlEnvironmentEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlEnvironmentName | The name of the Azure Machine Learning environment configuration. |
AmlEnvironmentVersion | The name of the Azure Machine Learning environment configuration version. |
AMLOnlineEndpointTrafficLog table (preview)
Property | Description |
---|---|
Method | The requested method from client. |
Path | The requested path from client. |
SubscriptionId | The machine learning subscription ID of the online endpoint. |
AzureMLWorkspaceId | The machine learning workspace ID of the online endpoint. |
AzureMLWorkspaceName | The machine learning workspace name of the online endpoint. |
EndpointName | The name of the online endpoint. |
DeploymentName | The name of the online deployment. |
Protocol | The protocol of the request. |
ResponseCode | The final response code returned to the client. |
ResponseCodeReason | The final response code reason returned to the client. |
ModelStatusCode | The response status code from model. |
ModelStatusReason | The response status reason from model. |
RequestPayloadSize | The total bytes received from the client. |
ResponsePayloadSize | The total bytes sent back to the client. |
UserAgent | The user-agent header of the request, including comments but truncated to a max of 70 characters. |
XRequestId | The request ID generated by Azure Machine Learning for internal tracing. |
XMSClientRequestId | The tracking ID generated by the client. |
TotalDurationMs | Duration in milliseconds from the request start time to the last response byte sent back to the client. If the client disconnected, it measures from the start time to client disconnect time. |
RequestDurationMs | Duration in milliseconds from the request start time to the last byte of the request received from the client. |
ResponseDurationMs | Duration in milliseconds from the request start time to the first response byte read from the model. |
RequestThrottlingDelayMs | Delay in milliseconds in request data transfer due to network throttling. |
ResponseThrottlingDelayMs | Delay in milliseconds in response data transfer due to network throttling. |
For more information on this log, see Monitor online endpoints.
AMLOnlineEndpointConsoleLog
Property | Description |
---|---|
TimeGenerated | The timestamp (UTC) of when the log was generated. |
OperationName | The operation associated with log record. |
InstanceId | The ID of the instance that generated this log record. |
DeploymentName | The name of the deployment associated with the log record. |
ContainerName | The name of the container where the log was generated. |
Message | The content of the log. |
For more information on this log, see Monitor online endpoints.
AMLOnlineEndpointEventLog (preview)
Property | Description |
---|---|
TimeGenerated | The timestamp (UTC) of when the log was generated. |
OperationName | The operation associated with log record. |
InstanceId | The ID of the instance that generated this log record. |
DeploymentName | The name of the deployment associated with the log record. |
Name | The name of the event. |
Message | The content of the event. |
For more information on this log, see Monitor online endpoints.
Related content
- See Monitor Machine Learning for a description of monitoring Machine Learning.
- See Monitor Azure resources with Azure Monitor for details on monitoring Azure resources.