Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta.
This page has notebook examples for multi-node and multi-GPU distributed training using Serverless GPU compute. These examples demonstrate how to scale training across multiple GPUs and nodes for improved performance.
Choose your parallelism technique
When scaling your model training across multiple GPUs, choosing the right parallelism technique depends on your model size, available GPU memory, and performance requirements.
| Technique | When to use |
|---|---|
| DDP (Distributed Data Parallel) | Full model fits in single GPU memory; need to scale data throughput |
| FSDP (Fully Sharded Data Parallel) | Very large models that don't fit in single GPU memory |
| DeepSpeed ZeRO | Large models with advanced memory optimization needs |
For detailed information about each technique, see DDP, FSDP, and DeepSpeed.
Example notebooks by technique and framework
The following table organizes example notebooks by the framework/library you're using and the parallelism technique applied. Multiple notebooks may appear in a single cell.
| Framework/Library | DDP examples | FSDP examples | DeepSpeed examples |
|---|---|---|---|
| PyTorch (native) | Simple MLP neural network MLflow 3.0 integration (Iris classification) |
10M parameter transformer | — |
| TRL + PyTorch | Fine-tune GPT-OSS | Fine-tune GPT-OSS | Fine-tune Llama 1B |
| Unsloth | Fine-tune Llama 3.2 3B | — | — |
| Ray Train | ResNet18 on FashionMNIST (computer vision) | — | — |
| PyTorch Lightning | Two-tower recommender system | — | — |
Get started
The following notebook has a basic example of how to use the Serverless GPU Python API to launch multiple A10 GPUs for distributed training.