Azure ML jobs hanging if we use too many processes. (Python)

Question

Azure ML jobs hanging if we use too many processes. (Python)

Nicolas Petitclerc 20

We optimized a slow part of our job using multiprocessing.Pool (Python). We noticed that jobs are consistently getting stuck, with no activity, forever (many hours longer than it should take, until we kill the job).
We are running our jobs on a cluster of up to 4 nodes, with VMs:

Standard_D13_v2 (8 cores, 56 GB RAM, 400 GB disk)

We noticed that reducing the number of workers in the pool to 4 worked consistently (we initially assumed we could use all 8 cores of the VM). But recently this is job is also getting stuck and we now need to reduce to 2 workers for it to work!
Why aren't we able to use up all of the cores on our VM?

Accepted answer

1 additional answer

Your answer

Answer 1

santoshkc 15,355 Microsoft External Staff Moderator

Hi @Nicolas Petitclerc,

We have noticed that you rated an answer as not helpful. We appreciate your feedback and are committed to improving your experience with the Q&A. I have worked on finding an alternative solution to address your issue.

As you mentioned, you are not seeing increase in memory, it might be pointing towards deadlock condition (threads are not able to release and join properly). You can refer our Parallel job documentation which gives more robust way to handle job concurrency, timeout, no.of nodes, batch_size, etc and this is more efficient compared to python multiprocessing.pool. You can optimize the above parameters (batch_size, etc) to optimize inference time and overall performance.

Please refer to: How to use parallel jobs in pipelines - Azure Machine Learning | Microsoft Learn

User's image

If the information is helpful, please click on "Upvote".

If you have any further queries, please let us know. Thank you.

Nicolas Petitclerc 20 Reputation points

2025-01-29T12:11:36.51+00:00

I see, it's a bit unfortunate that multiprocessing doesn't work out of the box, as this solution is a bit more complicated.
In the end I think we will use 'concurrent.future' and just run our functions asynchronously on the same processor, since our bottleneck was I/O, we are just downloading & preprocess different files.
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2025-01-29T13:29:05.55+00:00

Hi @Nicolas Petitclerc,

It's a good approach. Since your bottleneck is I/O, running tasks asynchronously on the same processor should help improve efficiency.

Thank you.

Answer 2

Azar 29,520 MVP Volunteer Moderator

Hi there Nicolas Petitclerc

thanks for using QandA platform

Even though your VM has 8 cores, some may be reserved for system tasks, so try setting processes=n-2 to avoid CPU overload. monitor RAM with psutil and consider batching tasks. If multiprocessing.Pool is slowing down due to pickle serialization, try using Ray or joblib for better efficiency. Azure ML's auto-scaling or job scheduling might also be affecting worker distribution, so testing with a single node first can help. consider switching to Ray for distributed execution. Let us know if this helps

If this helps kindly accept the answer thanks much.

Nicolas Petitclerc 20 Reputation points

2025-01-29T09:32:10.2833333+00:00

Thanks, but we are having issues running with 4 out of 8 cores already, so we are using processes=n-4 right now. So that can't be just system tasks. We see there is no ram issues from the monitoring tab.

Share via

Azure ML jobs hanging if we use too many processes. (Python)

1 additional answer

Your answer