ADF managed Airflow tasks fail without running or logs

Peter Yacoub 0 Reputation points
2024-07-21T12:41:08.4033333+00:00

Hi,

I have a managed Airflow instance inside Azure Data Factory. I am seeing a behaviour on a semi weekly basis where tasks scheduled for running suddenly fail with no logs. Whenever I retry any task (regardless of how heavy or light it is). It instantly fails with no logs, no record of a run even.

hitting the "/heath" endpoint returns that the metadata database, scheduler, and triggerer are all healthy and sending heartbeats.

My main guess is that the airflow workers/executors died at some point and were not replaced.

I looked through the metrics, the average CPU usage peak was at 50% and average MEMORY usage peak was at 70%.

I am running auto-scaling configuration with minimum of 3 and maximum of 8 "Small" nodes.

Right now the only option I seem to have is to restart the airflow instance as a whole and then manually go through each airflow task to restart it.

would appreciate any help or guidance on how to fix this issue.

Running Airflow 2.6.3 which seems to be the only version available.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,625 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 34,661 Reputation points MVP Volunteer Moderator
    2024-07-21T16:24:44.76+00:00

    Hi Peter Yacoub,

    Thanks for reaching out to Microsoft Q&A.

    The issue you're experiencing with adf managed airflow tasks failing without running or logs can be challenging. I am listing out the potential causes that might occur, pls check these once.

    • Executor/Worker Failures: As you suspected, if the Airflow workers or executors have died and were not replaced, this could lead to tasks failing without logs. This situation can occur if the workers are overwhelmed or if there's a resource allocation issue.
    • Resource Constraints: Although you mentioned that the average CPU usage peaked at 50% and memory usage at 70%, these metrics can sometimes be misleading. If there are spikes in resource usage that exceed the limits during peak times, it could lead to worker failures.
    • Auto-Scaling Configuration: Your auto-scaling configuration with a minimum of 3 and a maximum of 8 nodes might not be sufficient during high loads. If tasks are queued and the system cannot scale up quickly enough, it may lead to failures.
    • Database Connection Issues: Even if the health check shows that the metadata database, scheduler, and trigger are healthy, intermittent connectivity issues could still cause task failures.

    Below are the recommended solutions. You can try each one of these to test & narrow down if in case your issue is getting better.

    1. Consider increasing the number of nodes or changing the node size to a larger instance type to provide more resources for your tasks.
    2. Make sure that there are no dependencies between tasks that could lead to deadlocks or resource contention.
    3. Set up more detailed logging and monitoring for your Airflow instance. This can help identify patterns or specific times when failures occur. Use Azure Monitor or Application Insights to track performance metrics.
    4. Verify the configuration of your Airflow workers. Ensure they are set up to handle the expected load and that there are no misconfigurations that could lead to failures.
    5. Implement a robust retry logic in your DAGs. If a task fails, ensure that it has a proper retry mechanism before it is marked as failed.
    6. As a temporary measure, restarting the Airflow instance can help clear any transient issues. However, this should not be a long-term solution.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.