Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.
The following examples are complete, end-to-end workloads you submit from the air CLI
with air run -f train.yaml. Each shows a real distributed-training pattern on H100 GPUs,
including the workload YAML, launcher script, and training code. Start with the
quickstart if you haven't submitted a run before.
| Example | Description |
|---|---|
| Multi-node LLM fine-tuning with FSDP | Supervised fine-tuning of Llama-3.1-8B across 16 H100 GPUs (2 nodes) using torchrun and PyTorch Fully Sharded Data Parallel (FSDP). Logs to MLflow and checkpoints to a Unity Catalog volume. |
| Distributed training with Ray Train | Distributed data-parallel fine-tuning with Ray Train's TorchTrainer across 8 H100 GPUs on a single node, with one worker per GPU. |