Hello @nam
I hope yo are doing well. We have multiple options for Distributed GPU for Azure Machine Learnig for SDK v1 as below -
Message Passing Interface (MPI)
Horovod
DeepSpeed
Environment variables from Open MPI
PyTorch
Process group initialization
Launch options
DistributedDataParallel (per-process-launch)
Using torch.distributed.launch (per-node-launch)
PyTorch Lightning
Hugging Face Transformers
TensorFlow
Environment variables for TensorFlow (TF_CONFIG)
Accelerate GPU training with InfiniBand
For V2 there should be big change. Please feel free to let us know any problems. Thanks.
Regards,
Yutong