Share via


Multi-GPU and multi-node distributed training

Important

This feature is in Beta.

This page has notebook examples for multi-node and multi-GPU distributed training using Serverless GPU compute. These examples demonstrate how to scale training across multiple GPUs and nodes for improved performance.

Choose your parallelism technique

When scaling your model training across multiple GPUs, choosing the right parallelism technique depends on your model size, available GPU memory, and performance requirements.

Technique When to use
DDP (Distributed Data Parallel) Full model fits in single GPU memory; need to scale data throughput
FSDP (Fully Sharded Data Parallel) Very large models that don't fit in single GPU memory
DeepSpeed ZeRO Large models with advanced memory optimization needs

For detailed information about each technique, see DDP, FSDP, and DeepSpeed.

Example notebooks by technique and framework

The following table organizes example notebooks by the framework/library you're using and the parallelism technique applied. Multiple notebooks may appear in a single cell.

Framework/Library DDP examples FSDP examples DeepSpeed examples
PyTorch (native) Simple MLP neural network
MLflow 3.0 integration (Iris classification)
10M parameter transformer
TRL + PyTorch Fine-tune GPT-OSS Fine-tune GPT-OSS Fine-tune Llama 1B
Unsloth Fine-tune Llama 3.2 3B
Ray Train ResNet18 on FashionMNIST (computer vision)
PyTorch Lightning Two-tower recommender system

Get started

The following notebook has a basic example of how to use the Serverless GPU Python API to launch multiple A10 GPUs for distributed training.

Serverless GPU API: A10 starter

Get notebook