Edit

Share via


Send distributed training logs to Azure Application Insights

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Azure Machine Learning simplifies debugging and observability in distributed training scenarios. Training jobs generate multiple log files - often one per worker - which makes error diagnosis cumbersome. For example, a 10-node cluster with eight GPUs per node can produce 80 separate log files. You can now send these logs to a central Azure Application Insights AppTraces table enabling fast query-based error and exception detection.

Key benefits:

  • Centralized log access: Aggregates stdout and stderr from all workers into Application Insights.

  • Searchable logs: Use Kusto queries to filter errors, warnings, or custom patterns.

  • Improved debuggability: Reduces time spent manually inspecting multiple files.

  • Configurable retention and billing: Logs are retained for 90 days by default in an AppTraces Table with table type as Analytics; ingestion is billed by size of logs, retention beyond 90 days can be configured at additional cost. For more information, see Manage data retention.

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Prerequisites

  • An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.

  • An Azure Machine Learning workspace. For steps to create a workspace, see Create workspace resources.

  • Application Insights resource is configured to support local authentication for writing traces (Microsoft Entra ID based authentication isn't supported yet).

  • Compute cluster running the job has network access to the linked Application Insights workspace.

  • Your Azure Machine Learning workspace must not be a Hub workspace.

  • You must have the Log Analytics Reader role assigned in the Log Analytics workspace in order to query and search logs. For more information, see Manage access to Log Analytics workspaces.

Enable log forwarding to Application Insights

Set the AZUREML_COMMON_RUNTIME_USE_APPINSIGHTS_CAPABILITY environment variable in your training job configuration.

In Azure Machine Learning studio, add the environment variable when you configure your job:

  1. Go to your job configuration.
  2. In the Environment variables section, add the following values:
    • Name: AZUREML_COMMON_RUNTIME_USE_APPINSIGHTS_CAPABILITY
    • Value: true

Screenshot of portal environment variable configuration.

Query training job logs

After configuring log forwarding, you can query your training logs in Application Insights.

  1. Go to the job overview page in Azure Machine Learning studio.

  2. Select the Job Logs link.

    Screenshot of job overview page with Job Logs link.

  3. You're taken to an Application Insights workspace with a default query filtered by the job ID.

    Screenshot of application insights workspace with default query.

  4. Logs are written following the AppTraces schema. Edit the query to search for errors, exceptions, or other points of interest across nodes.

    Screenshot of query editor for searching logs.

Useful log fields

The most useful fields in the AppTraces table are:

  • timestamp – Timestamp of the log message
  • message – The log line from your training code
  • customDimensions – JSON with useful fields like job ID, source file name, source node, and more

Verify log ingestion

To verify that Application Insights receives your logs:

  1. Submit a test training job with the environment variable configured.
  2. Wait for the job to start running.
  3. Go to the job overview page and select the Job Logs link.
  4. Confirm that log traces appear in the Application Insights query results.

If you don't see logs, check the troubleshooting section.

Troubleshooting

  • If you look at the logs for an old job and you don't see any log messages, modify the default query. The default query only checks the last few days of logs in Application Insights.

  • Verify that you set the environment variable in Job Overview > Job Yaml.

  • Check Application Insights linkage in workspace settings by opening the workspace resource in the Azure portal and checking if the Application Insights link is populated.

  • Ensure the compute cluster has network access to the default Application Insights workspace linked to the Azure Machine Learning workspace.

  • Inspect appinsights-capability.log in system job logs for errors.

Next steps