Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
The AI Runtime CLI is in Beta.
This page is the reference for workload YAML configurations passed to air run --file.
Note
The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.
Minimal configuration
experiment_name: my-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
Submit with:
air run --file train.yaml -p profile
Core concepts
Core fields
Most training configurations include five components:
experiment_name: Required. Creates or appends to an MLflow experiment.environment: Optional. Python dependencies and base environment.compute: Required. GPU resources (type and count).command: Required. The bash command or commands used to launch training.code_source: Optional. Path to your training code, made available remotely.
Your first training job
experiment_name: simple-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py
In this configuration:
experiment_namecreates an MLflow experiment namedsimple-training(or appends a new run if it already exists).environmentinstalls dependencies fromrequirements.yaml.computeallocates one H100 node (8 H100 GPUs).code_sourceuploads the folderrepoto the node, available at$CODE_SOURCE_PATH.commandrunstrain.pyviatorchrunacross the 8 H100 GPUs. The file lives at/home/username/repo/train.pylocally.
Common use cases
Add environment variables
experiment_name: training-with-env
environment:
dependencies: requirements.yaml
env_variables:
BATCH_SIZE: '32'
LEARNING_RATE: '0.001'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Use secrets (API keys, tokens)
experiment_name: training-with-secrets
environment:
dependencies: requirements.yaml
secrets:
HF_TOKEN: 'my_scope/hf_token'
WANDB_API_KEY: 'my_scope/wandb'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.
When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.
Work with code sources
The code_source block uploads local code so the training job can run it.
root_pathis the local directory to snapshot. By default,airpackages the working tree as-is (including any uncommitted changes) as a plain tarball.- To snapshot a pinned git version instead, add a
git:block with abranchorcommit. This requiresroot_pathto be a git repository and enables version-aware snapshotting (caching,git archive). - For large repositories,
include_pathslets you snapshot a subset.
Minimal example
experiment_name: simple-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py
On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path — use it in your command rather than hard-coding the location.
Git repositories: pin by branch or commit
For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive — specify exactly one within the block.
Pin to a branch (uses the local HEAD of that branch):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh
Pin to a commit SHA (exact reproducibility):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
commit: abc1234567 # Pins specific commit
command: train.sh
Key fields:
root_path(Required) — Local path to the root of your git repository.git.branch(Optional) — Branch name. Uses local HEAD; no remote fetch. Mutually exclusive withgit.commit.git.commit(Optional) — Specific commit SHA. Mutually exclusive withgit.branch.git.remote(Optional) — Use the branch's remote HEAD instead of the local one. Set totrueto auto-detect the remote, or to a remote name (for example,upstream) to fetch from a specific remote. Only valid withgit.branch.
If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes — no extra field is required.
Non-git directories
You can snapshot directories that aren't git repositories. Omit the git: block — it requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.
code_source:
type: snapshot
snapshot:
root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py
Folder filtering with include_paths
For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
include_paths:
- research/models
- research/common
- research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py
Key points:
- The field is optional. If omitted, the entire repository is included by default.
- Paths must be relative to the repository root (no leading
/). ..is not allowed; you cannot reference parent directories.
Advanced features
Custom hyperparameters
Pass structured configuration to your training script via HYPERPARAMETERS_PATH:
experiment_name: parameterized-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
learning_rate: 0.0001
Read them in your script:
import os
import yaml
with open(os.environ['HYPERPARAMETERS_PATH']) as f:
params = yaml.safe_load(f)
learning_rate = params['training']['learning_rate']
model_name = params['model']['name']
Job reliability
experiment_name: reliable-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90
If the workload fails, it is retried twice. Each attempt has 90 minutes to complete — the total wall-clock budget is 90 × 3 = 270 minutes.
Cost attribution
Attach a workload to an existing budget policy via usage_policy_id. For setup, see Attribute usage with serverless usage policies.
experiment_name: my-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_id: abcd123-25b8-3e87-9a2c-f86eb19d101c
Reference
Core fields
| Field | Type | Description | Example |
|---|---|---|---|
experiment_name |
string | Experiment name for MLflow. | "my-training-job" |
environment.dependencies |
string | Path to requirements.yaml. |
"requirements.yaml" |
compute.num_accelerators |
int | Number of GPUs. | 1, 4, 8 |
compute.accelerator_type |
string | GPU type. | "GPU_1xA10", "GPU_8xH100" |
code_source |
dict | Code source configuration. | See Work with code sources. |
command |
string | Bash commands to launch training. | torchrun --nproc_per_node=8 train.py |
Supported GPU types
accelerator_type |
GPUs per node | Notes |
|---|---|---|
GPU_1xA10 |
1 | Single A10 — good for development and small workloads. |
GPU_1xH100 |
1 | Single H100. |
GPU_8xH100 |
8 | Full H100 node — typical for distributed training. |
For accelerator capabilities and recommended use cases, see Hardware options.
Optional fields
Environment configuration
environment:
dependencies: requirements.yaml
env_variables:
BATCH_SIZE: '32'
secrets:
HF_TOKEN: 'my_scope/hf_token'
For the dependencies file format, see requirements.yaml reference.
Code source configuration
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo # REQUIRED — local path to repo or directory
git: # Optional (git repos only) — pin to a branch or commit
branch: main # Branch name; uses local HEAD unless 'remote' is set
# commit: abc1234567 # Mutually exclusive with 'branch'
remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
include_paths: # Optional — filter included paths
- src/
- configs/
Field constraints:
git.branchandgit.commitare mutually exclusive — specify exactly one within thegit:block.git.remoterequiresgit.branch(it has no effect withgit.commit).- If you omit the
git:block, the working tree is packaged as a plain tarball, including any uncommitted changes.
Custom parameters
Passed to the workload via HYPERPARAMETERS_PATH:
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
MLflow run name
mlflow_run_name: 'experiment-001-baseline'
Path resolution
All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.
Folder structure:
/home/username/my-project/
├── train.yaml
├── requirements.yaml
└── scripts/
└── train.py
YAML configuration:
experiment_name: my-training
environment:
dependencies: requirements.yaml # Relative to train.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: . # Relative to train.yaml
git:
branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py