Is Azure supporting distributed GPU?

Question

Is there any plan? Any date we can expect?

Accepted Answer

Hello @nam

I hope yo are doing well. We have multiple options for Distributed GPU for Azure Machine Learnig for SDK v1 as below -
Message Passing Interface (MPI)
Horovod
DeepSpeed
Environment variables from Open MPI
PyTorch
Process group initialization
Launch options
DistributedDataParallel (per-process-launch)
Using torch.distributed.launch (per-node-launch)
PyTorch Lightning
Hugging Face Transformers
TensorFlow
Environment variables for TensorFlow (TF_CONFIG)
Accelerate GPU training with InfiniBand

For V2 there should be big change. Please feel free to let us know any problems. Thanks.

Regards,
Yutong

Answer

Hi Nam,

Do you mean by distributed GPU for ML or specific to Azure ? Here is the link related to Azure GPU - how-to-train-distributed-gpu

If it is something else, please reply so I can have a look.

==
Please "Accept the answer" if the information helped you. This will help us and others in the community as well.

Is Azure supporting distributed GPU?

1 additional answer