Azure ML jobs hanging if we use too many processes. (Python)

Nicolas Petitclerc 20 Reputation points
2025-01-29T09:03:38.4633333+00:00

We optimized a slow part of our job using multiprocessing.Pool (Python). We noticed that jobs are consistently getting stuck, with no activity, forever (many hours longer than it should take, until we kill the job).
We are running our jobs on a cluster of up to 4 nodes, with VMs:

Standard_D13_v2 (8 cores, 56 GB RAM, 400 GB disk)

We noticed that reducing the number of workers in the pool to 4 worked consistently (we initially assumed we could use all 8 cores of the VM). But recently this is job is also getting stuck and we now need to reduce to 2 workers for it to work!
Why aren't we able to use up all of the cores on our VM?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,343 questions
0 comments No comments
{count} votes

Accepted answer
  1. santoshkc 15,355 Reputation points Microsoft External Staff Moderator
    2025-01-29T11:40:56.3866667+00:00

    Hi @Nicolas Petitclerc,

    We have noticed that you rated an answer as not helpful. We appreciate your feedback and are committed to improving your experience with the Q&A. I have worked on finding an alternative solution to address your issue.

    As you mentioned, you are not seeing increase in memory, it might be pointing towards deadlock condition (threads are not able to release and join properly). You can refer our Parallel job documentation which gives more robust way to handle job concurrency, timeout, no.of nodes, batch_size, etc and this is more efficient compared to python multiprocessing.pool. You can optimize the above parameters (batch_size, etc) to optimize inference time and overall performance.

    Please refer to: How to use parallel jobs in pipelines - Azure Machine Learning | Microsoft Learn

    User's image

    If the information is helpful, please click on "Upvote".

    If you have any further queries, please let us know. Thank you.


1 additional answer

Sort by: Most helpful
  1. Azar 29,520 Reputation points MVP Volunteer Moderator
    2025-01-29T09:12:12.5333333+00:00

    Hi there Nicolas Petitclerc

    thanks for using QandA platform

    Even though your VM has 8 cores, some may be reserved for system tasks, so try setting processes=n-2 to avoid CPU overload. monitor RAM with psutil and consider batching tasks. If multiprocessing.Pool is slowing down due to pickle serialization, try using Ray or joblib for better efficiency. Azure ML's auto-scaling or job scheduling might also be affecting worker distribution, so testing with a single node first can help. consider switching to Ray for distributed execution. Let us know if this helps

    If this helps kindly accept the answer thanks much.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.