Hi Peter Yacoub,
Thanks for reaching out to Microsoft Q&A.
The issue you're experiencing with adf managed airflow tasks failing without running or logs can be challenging. I am listing out the potential causes that might occur, pls check these once.
- Executor/Worker Failures: As you suspected, if the Airflow workers or executors have died and were not replaced, this could lead to tasks failing without logs. This situation can occur if the workers are overwhelmed or if there's a resource allocation issue.
- Resource Constraints: Although you mentioned that the average CPU usage peaked at 50% and memory usage at 70%, these metrics can sometimes be misleading. If there are spikes in resource usage that exceed the limits during peak times, it could lead to worker failures.
- Auto-Scaling Configuration: Your auto-scaling configuration with a minimum of 3 and a maximum of 8 nodes might not be sufficient during high loads. If tasks are queued and the system cannot scale up quickly enough, it may lead to failures.
- Database Connection Issues: Even if the health check shows that the metadata database, scheduler, and trigger are healthy, intermittent connectivity issues could still cause task failures.
Below are the recommended solutions. You can try each one of these to test & narrow down if in case your issue is getting better.
- Consider increasing the number of nodes or changing the node size to a larger instance type to provide more resources for your tasks.
- Make sure that there are no dependencies between tasks that could lead to deadlocks or resource contention.
- Set up more detailed logging and monitoring for your Airflow instance. This can help identify patterns or specific times when failures occur. Use Azure Monitor or Application Insights to track performance metrics.
- Verify the configuration of your Airflow workers. Ensure they are set up to handle the expected load and that there are no misconfigurations that could lead to failures.
- Implement a robust retry logic in your DAGs. If a task fails, ensure that it has a proper retry mechanism before it is marked as failed.
- As a temporary measure, restarting the Airflow instance can help clear any transient issues. However, this should not be a long-term solution.
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.