Diagnostics logs and metrics for Workflow Orchestration Manager
Note
Workflow Orchestration Manager is powered by Apache Airflow.
This article walks you through the steps to:
- Enable diagnostics logs and metrics for Workflow Orchestration Manager in Azure Data Factory.
- View logs and metrics.
- Run a query.
- Monitor metrics and set the alert system in directed acyclic graph (DAG) failure.
Prerequisites
You need an Azure subscription. If you don't have an Azure subscription, create a free Azure account before you begin.
Enable diagnostics logs and metrics for Workflow Orchestration Manager
Open your Data Factory resource and select Diagnostic settings on the leftmost pane. Then select Add diagnostic setting.
Fill out the Diagnostic settings name. Select the following categories for the Airflow logs:
- Airflow task execution logs
- Airflow worker logs
- Airflow DAG processing logs
- Airflow scheduler logs
- Airflow web logs
- If you select AllMetrics, various Data Factory metrics are made available for you to monitor or raise alerts on. These metrics include the metrics for Data Factory activity and the Workflow Orchestration Manager integration runtime, such as
AirflowIntegrationRuntimeCpuUsage
andAirflowIntegrationRuntimeMemory
.
Under Destination details, select the Send to Log Analytics workspace checkbox.
Select Save.
View logs
After you add diagnostic settings, you can find them listed in the Diagnostic setting section. To access and view logs, select the Log Analytics workspace that you configured.
Under the section Maximize your Log Analytics experience, select View logs.
You're directed to your Log Analytics workspace where you can see that the tables you selected were imported into the workspace automatically.
Other useful links for the schema:
- Azure Monitor Logs reference - ADFAirflowSchedulerLogs | Microsoft Learn
- Azure Monitor Logs reference - ADFAirflowTaskLogs | Microsoft Learn
- Azure Monitor Logs reference - ADFAirflowWebLogs | Microsoft Learn
- Azure Monitor Logs reference - ADFAirflowWorkerLogs | Microsoft Learn
- Azure Monitor Logs reference - AirflowDagProcessingLogs | Microsoft Learn
Write a query
Let's start with the simplest query that returns all the records in
ADFAirflowTaskLogs
. You can double-click the table name to add it to a query window. You can also enter the table name directly in the window.To narrow down your search results, such as filtering them based on a specific task ID, you can use the following query:
ADFAirflowTaskLogs | where DagId == "<your_dag_id>" and TaskId == "<your_task_id>"
Similarly, you can create custom queries according to your needs by using any tables available in LogManagement
.
For more information, see:
Monitor metrics
Data Factory offers comprehensive metrics for Airflow integration runtimes, allowing you to effectively monitor the performance of your Airflow integration runtime and establish alerting mechanisms as needed.
Open your Data Factory resource.
In the leftmost pane, under the Monitoring section, select Metrics.
Select the Scope > Metric Namespace > Metric you want to monitor.
Review the multiline chart that visualizes the Integration Runtime CPU Percentage and Integration Runtime Dag Bag Size.
You can set up an alert rule that triggers when your metrics meet specific conditions. For more information, see Overview of Azure Monitor alerts.
Select Save to dashboard after your chart is finished or else your chart disappears.
Airflow metrics
The following table lists the metrics available for Workflow Orchestration Manager. The table headings are:
- Metric: The metric display name as it appears in the Azure portal.
- Name in REST API: The metric name as referred to in the REST API.
- Description: A description of the metric.
- Unit: Unit of measure.
- Aggregation: The default aggregation type. Valid values are Average, Minimum, Maximum, Total, and Count.
- Dimensions: Dimensions available for the metric.
- Time grains: Intervals at which the metric is sampled. For example, PT1M indicates that the metric is sampled every minute, PT30M every 30 minutes, PT1H every hour, and so on.
- DS export: Whether the metric is exportable to Azure Monitor Logs via diagnostic settings.
Metric | Name in REST API | Description | Unit | Aggregation | Dimensions | Time grains | DS export |
---|---|---|---|---|---|---|---|
Airflow Integration Runtime Celery Task Timeout Error | AirflowIntegrationRuntimeCeleryTaskTimeoutError |
Number of AirflowTaskTimeout errors raised when publishing Task to Celery Broker. |
Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Collect DB Dags | AirflowIntegrationRuntimeCollectDBDags |
Milliseconds taken for fetching all Serialized DAGs from database. | Milliseconds | Average | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Cpu Percentage | AirflowIntegrationRuntimeCpuPercentage |
CPU usage percentage of the Airflow integration runtime. | Percent | Average | IntegrationRuntimeName , ContainerName |
PT1M | No |
Airflow Integration Runtime Memory Usage | AirflowIntegrationRuntimeCpuUsage |
Millicores consumed by Airflow integration runtime, indicating the CPU resources used in thousandths of a CPU core. | Millicores | Average | IntegrationRuntimeName , ContainerName |
PT1M | Yes |
Airflow Integration Runtime Dag Bag Size | AirflowIntegrationRuntimeDagBagSize |
Number of DAGs found when the scheduler ran a scan based on its configuration. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Dag Callback Exceptions | AirflowIntegrationRuntimeDagCallbackExceptions |
Number of exceptions raised from DAG callbacks. When exceptions occur, it means DAG callback isn't working. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG File Refresh Error | AirflowIntegrationRuntimeDAGFileRefreshError |
Number of failures loading any DAG files. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG Processing Import Errors | AirflowIntegrationRuntimeDAGProcessingImportErrors |
Number of errors from trying to parse DAG files. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG Processing Last Duration | AirflowIntegrationRuntimeDAGProcessingLastDuration |
Seconds taken to load the specific DAG file. | Milliseconds | Average | IntegrationRuntimeName , DagFile |
PT1M | No |
Airflow Integration Runtime DAG Processing Last Run Seconds Ago | AirflowIntegrationRuntimeDAGProcessingLastRunSecondsAgo |
Seconds since <dag_file> was last processed. | Seconds | Average | IntegrationRuntimeName , DagFile |
PT1M | No |
Airflow Integration Runtime DAG ProcessingManager Stalls | AirflowIntegrationRuntimeDAGProcessingManagerStalls |
Number of stalled DagFileProcessorManager . |
Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG Processing Processes | AirflowIntegrationRuntimeDAGProcessingProcesses |
Relative number of currently running DAG parsing processes. (For example, this delta is negative when, since the last metric was sent, processes were completed.) | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG Processing Processor Timeouts | AirflowIntegrationRuntimeDAGProcessingProcessorTimeouts |
Number of file processors that were killed because they took too long. | Seconds | Average | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG Processing Total Parse Time | AirflowIntegrationRuntimeDAGProcessingTotalParseTime |
Seconds taken to scan and import dag_processing.file_path_queue_size DAG files. |
Seconds | Average | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime DAG Run Dependency Check | AirflowIntegrationRuntimeDAGRunDependencyCheck |
Milliseconds taken to check DAG dependencies. | Milliseconds | Average | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime DAG Run Duration Failed | AirflowIntegrationRuntimeDAGRunDurationFailed |
Seconds taken for a DagRun to reach failed state. |
Milliseconds | Average | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime DAG Run Duration Success | AirflowIntegrationRuntimeDAGRunDurationSuccess |
Seconds taken for a DagRun to reach success state. |
Milliseconds | Average | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime DAG Run First Task Scheduling Delay | AirflowIntegrationRuntimeDAGRunFirstTaskSchedulingDelay |
Seconds elapsed between the first task start_date and the DagRun expected start. |
Milliseconds | Average | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime DAG Run Schedule Delay | AirflowIntegrationRuntimeDAGRunScheduleDelay |
Seconds of delay between the scheduled DagRun start date and the actual DagRun start date. |
Milliseconds | Average | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime Executor Open Slots | AirflowIntegrationRuntimeExecutorOpenSlots |
Number of open slots on the executor. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Executor Queued Tasks | AirflowIntegrationRuntimeExecutorQueuedTasks |
Number of queued tasks on the executor. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Executor Running Tasks | AirflowIntegrationRuntimeExecutorRunningTasks |
Number of running tasks on the executor. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Job End | AirflowIntegrationRuntimeJobEnd |
Number of ended <job_name> job, for example, SchedulerJob and LocalTaskJob . |
Count | Total | IntegrationRuntimeName , Job |
PT1M | No |
Airflow Integration Runtime Heartbeat Failure | AirflowIntegrationRuntimeJobHeartbeatFailure |
Number of failed Heartbeats for a <job_name> job, for example, SchedulerJob and LocalTaskJob . |
Count | Total | IntegrationRuntimeName , Job |
PT1M | No |
Airflow Integration Runtime Job Start | AirflowIntegrationRuntimeJobStart |
Number of started <job_name> jobs, for example, SchedulerJob and LocalTaskJob . |
Count | Total | IntegrationRuntimeName , Job |
PT1M | No |
Airflow Integration Runtime Memory Percentage | AirflowIntegrationRuntimeMemoryPercentage |
Memory Percentage used by Airflow integration runtime environments. | Percent | Average | IntegrationRuntimeName , ContainerName |
PT1M | Yes |
Airflow Integration Runtime Node Count | AirflowIntegrationRuntimeNodeCount |
Count | Average | IntegrationRuntimeName , ComputeNodeSize |
PT1M | Yes | |
Airflow Integration Runtime Operator Failures | AirflowIntegrationRuntimeOperatorFailures |
Total operator failures. | Count | Total | IntegrationRuntimeName , Operator |
PT1M | No |
Airflow Integration Runtime Operator Successes | AirflowIntegrationRuntimeOperatorSuccesses |
Total operator successes. | Count | Total | IntegrationRuntimeName , Operator |
PT1M | No |
Airflow Integration Runtime Pool Open Slots | AirflowIntegrationRuntimePoolOpenSlots |
Number of open slots in the pool. | Count | Total | IntegrationRuntimeName , Pool |
PT1M | No |
Airflow Integration Runtime Pool Queued Slots | AirflowIntegrationRuntimePoolQueuedSlots |
Number of queued slots in the pool. | Count | Total | IntegrationRuntimeName , Pool |
PT1M | No |
Airflow Integration Runtime Pool Running Slots | AirflowIntegrationRuntimePoolRunningSlots |
Number of running slots in the pool. | Count | Total | IntegrationRuntimeName , Pool |
PT1M | No |
Airflow Integration Runtime Pool Starving Tasks | AirflowIntegrationRuntimePoolStarvingTasks |
Number of starving tasks in the pool. | Count | Total | IntegrationRuntimeName , Pool |
PT1M | No |
Airflow Integration Runtime Scheduler Critical Section Busy | AirflowIntegrationRuntimeSchedulerCriticalSectionBusy |
Count of times a scheduler process tried to get a lock on the critical section (needed to send tasks to the executor) and found it locked by another process. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Critical Section Duration | AirflowIntegrationRuntimeSchedulerCriticalSectionDuration |
Milliseconds spent in the critical section of a scheduler loop. Only a single scheduler can enter this loop at a time. | Milliseconds | Average | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Failed SLA Email Attempts | AirflowIntegrationRuntimeSchedulerFailedSLAEmailAttempts |
Number of failed SLA miss email notification attempts. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Heartbeats | AirflowIntegrationRuntimeSchedulerHeartbeat |
Scheduler heartbeats. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Orphaned Tasks Adopted | AirflowIntegrationRuntimeSchedulerOrphanedTasksAdopted |
Number of orphaned tasks adopted by the Scheduler. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Orphaned Tasks Cleared | AirflowIntegrationRuntimeSchedulerOrphanedTasksCleared |
Number of orphaned tasks cleared by the Scheduler. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Tasks Executable | AirflowIntegrationRuntimeSchedulerTasksExecutable |
Number of tasks that are ready for execution (set to queued) with respect to pool limits, DAG concurrency, executor state, and priority. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Tasks Killed Externally | AirflowIntegrationRuntimeSchedulerTasksKilledExternally |
Number of tasks killed externally. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Scheduler Tasks Running | AirflowIntegrationRuntimeSchedulerTasksRunning |
Count | Total | IntegrationRuntimeName |
PT1M | No | |
Airflow Integration Runtime Scheduler Tasks Starving | AirflowIntegrationRuntimeSchedulerTasksStarving |
Number of tasks that can't be scheduled because of no open slot in the pool. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Started Task Instances | AirflowIntegrationRuntimeStartedTaskInstances |
Count | Total | IntegrationRuntimeName , DagId , TaskId |
PT1M | No | |
Airflow Integration Runtime Task Instance Created Using Operator | AirflowIntegrationRuntimeTaskInstanceCreatedUsingOperator |
Number of task instances created for a specific operator. | Count | Total | IntegrationRuntimeName , Operator |
PT1M | No |
Airflow Integration Runtime Task Instance Duration | AirflowIntegrationRuntimeTaskInstanceDuration |
Milliseconds | Average | IntegrationRuntimeName , DagId , TaskID |
PT1M | No | |
Airflow Integration Runtime Task Instance Failures | AirflowIntegrationRuntimeTaskInstanceFailures |
Overall task instances failures. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Task Instance Finished | AirflowIntegrationRuntimeTaskInstanceFinished |
Overall task instances finished. | Count | Total | IntegrationRuntimeName , DagId , TaskId , State |
PT1M | No |
Airflow Integration Runtime Task Instance Previously Succeeded | AirflowIntegrationRuntimeTaskInstancePreviouslySucceeded |
Number of previously succeeded task instances. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Task Instance Successes | AirflowIntegrationRuntimeTaskInstanceSuccesses |
Overall task instance successes. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Task Removed From DAG | AirflowIntegrationRuntimeTaskRemovedFromDAG |
Number of tasks removed for a specific DAG. (That is, the task no longer exists in DAG.) | Count | Total | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime Task Restored To DAG | AirflowIntegrationRuntimeTaskRestoredToDAG |
Number of tasks restored for a specific DAG. (That is, a task instance that was previously in a REMOVED state in the database is added to a DAG file.) | Count | Total | IntegrationRuntimeName , DagId |
PT1M | No |
Airflow Integration Runtime Triggers Blocked Main Thread | AirflowIntegrationRuntimeTriggersBlockedMainThread |
Number of triggers that blocked the main thread (likely because they weren't fully asynchronous). | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Triggers Failed | AirflowIntegrationRuntimeTriggersFailed |
Number of triggers that errored before they could fire an event. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Triggers Running | AirflowIntegrationRuntimeTriggersRunning |
Number of triggers currently running for a triggerer (described by hostname). | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Triggers Succeeded | AirflowIntegrationRuntimeTriggersSucceeded |
Number of triggers that fired at least one event. | Count | Total | IntegrationRuntimeName |
PT1M | No |
Airflow Integration Runtime Zombie Tasks Killed | AirflowIntegrationRuntimeZombiesKilled |
Zombie tasks killed. | Count | Total | IntegrationRuntimeName |
PT1M | No |
For more information, see Supported metrics for Microsoft.DataFactory/factories.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for