Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
The AI Runtime CLI is in Beta.
This page is the reference for workload YAML configurations passed to air run --file.
Note
The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.
Minimal configuration
experiment_name: my-training
environment:
dependencies:
- mlflow
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
Submit with:
air run --file train.yaml -p profile
Core concepts
Core fields
Most training configurations include five components:
experiment_name: Required. Creates or appends to an MLflow experiment.environment: Optional. Python dependencies and base environment.compute: Required. GPU resources (type and count).command: Required. The bash command or commands used to launch training.code_source: Optional. Path to your training code, made available remotely.
Your first training job
experiment_name: simple-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py
In this configuration:
experiment_namecreates an MLflow experiment namedsimple-training(or appends a new run if it already exists).environmentinstalls the listed Python dependencies (here,torchandtransformers).computeallocates one H100 node (8 H100 GPUs).code_sourceuploads the folderrepoto the node, available at$CODE_SOURCE_PATH.commandrunstrain.pyviatorchrunacross the 8 H100 GPUs. The file lives at/home/username/repo/train.pylocally.
Common use cases
Add environment variables
experiment_name: training-with-env
environment:
dependencies:
- torch
- transformers
env_variables:
BATCH_SIZE: '32'
LEARNING_RATE: '0.001'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Use secrets (API keys, tokens)
experiment_name: training-with-secrets
environment:
dependencies:
- torch
- transformers
secrets:
HF_TOKEN: 'my_scope/hf_token'
WANDB_API_KEY: 'my_scope/wandb'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.
When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.
Python dependencies
List your workload's Python dependencies as an inline list under environment.dependencies:
environment:
version: '4'
dependencies:
- torch
- transformers
environment.version selects the serverless GPU environment version. It is optional and defaults to "4".
Dependency format
The dependency list follows the Databricks Base Environment Specification. Each entry is a pip-style package spec (for example, my-library==6.1). The list also accepts the following entries:
- Requirements files: a reference to an existing
requirements.txtusing-r, for example-r '/Workspace/Shared/requirements.txt'. Environment variables such as$HOMEare expanded. - Wheels: an absolute path to a
.whlfile, for example/Workspace/Shared/path/to/simplejson-3.19.3-py3-none-any.whl. - Index URLs: an index URL, for example
--index-url https://pypi.org/simple.
environment:
version: '4'
dependencies:
- --index-url https://pypi.org/simple
- -r '/Workspace/Shared/requirements.txt'
- my-library==6.1
- /Workspace/Shared/path/to/simplejson-3.19.3-py3-none-any.whl
Supported install flags
Dependencies are installed with uv. The following pip-style flags are supported as list entries:
- Applied to the whole install:
--index-url,--extra-index-url, and--find-links(-f) set or extend the package indexes. - Applied to the dependency that follows them:
--no-deps,--no-build-isolation,--no-cache-dir, and--force-reinstall. Place the flag on its own line (or before the spec), followed by the dependency it applies to.
For example, to install flash-attn against the already-installed torch (no build isolation) and without resolving its own dependencies:
environment:
version: '4'
dependencies:
- torch
- --no-build-isolation
- --no-deps
- flash-attn
Note
--trusted-host is not supported. Because uv configures trust per index URL, use --index-url or --extra-index-url instead.
Work with code sources
The code_source block uploads local code so the training job can run it.
root_pathis the local directory to snapshot. By default,airpackages the working tree as-is (including any uncommitted changes) as a plain tarball.- To snapshot a pinned git version instead, add a
git:block with abranchorcommit. This requiresroot_pathto be a git repository and enables version-aware snapshotting (caching,git archive). - For large repositories,
include_pathslets you snapshot a subset.
Minimal example
experiment_name: simple-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py
On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path — use it in your command rather than hard-coding the location.
Git repositories: pin by branch or commit
For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive — specify exactly one within the block.
Pin to a branch (uses the local HEAD of that branch):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh
Pin to a commit SHA (exact reproducibility):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
commit: abc1234567 # Pins specific commit
command: train.sh
Key fields:
root_path(Required) — Local path to the root of your git repository.git.branch(Optional) — Branch name. Uses local HEAD; no remote fetch. Mutually exclusive withgit.commit.git.commit(Optional) — Specific commit SHA. Mutually exclusive withgit.branch.git.remote(Optional) — Use the branch's remote HEAD instead of the local one. Set totrueto auto-detect the remote, or to a remote name (for example,upstream) to fetch from a specific remote. Only valid withgit.branch.
If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes — no extra field is required.
Non-git directories
You can snapshot directories that aren't git repositories. Omit the git: block — it requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.
code_source:
type: snapshot
snapshot:
root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py
Folder filtering with include_paths
For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
include_paths:
- research/models
- research/common
- research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py
Key points:
- The field is optional. If omitted, the entire repository is included by default.
- Paths must be relative to the repository root (no leading
/). ..is not allowed; you cannot reference parent directories.
Advanced features
Custom hyperparameters
Pass structured configuration to your training script via HYPERPARAMETERS_PATH:
experiment_name: parameterized-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
learning_rate: 0.0001
Read them in your script:
import os
import yaml
with open(os.environ['HYPERPARAMETERS_PATH']) as f:
params = yaml.safe_load(f)
learning_rate = params['training']['learning_rate']
model_name = params['model']['name']
Job reliability
experiment_name: reliable-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90
If the workload fails, it is retried twice. Each attempt has 90 minutes to complete — the total wall-clock budget is 90 × 3 = 270 minutes.
Cost attribution
Attach a workload to an existing budget policy via usage_policy_id. For setup, see Attribute usage with serverless usage policies.
experiment_name: my-training
environment:
dependencies:
- mlflow
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_id: abcd123-25b8-3e87-9a2c-f86eb19d101c
Reference
Core fields
| Field | Type | Description | Example |
|---|---|---|---|
experiment_name |
string | Experiment name for MLflow. | "my-training-job" |
environment.dependencies |
list | Inline list of pip dependency specs. | ["torch", "transformers"] |
environment.version |
string | Serverless GPU environment version. Optional. Defaults to "4". |
"4" |
compute.num_accelerators |
int | Number of GPUs. | 1, 4, 8 |
compute.accelerator_type |
string | GPU type. | "GPU_1xA10", "GPU_8xH100" |
code_source |
dict | Code source configuration. | See Work with code sources. |
command |
string | Bash commands to launch training. | torchrun --nproc_per_node=8 train.py |
Supported GPU types
accelerator_type |
GPUs per node | Notes |
|---|---|---|
GPU_1xA10 |
1 | Single A10 — good for development and small workloads. |
GPU_1xH100 |
1 | Single H100. |
GPU_8xH100 |
8 | Full H100 node — typical for distributed training. |
For accelerator capabilities and recommended use cases, see Hardware options.
Optional fields
Environment configuration
environment:
version: '4'
dependencies:
- torch
- transformers
env_variables:
BATCH_SIZE: '32'
secrets:
HF_TOKEN: 'my_scope/hf_token'
For the dependency format, supported install flags, and environment.version, see Python dependencies.
Code source configuration
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo # REQUIRED — local path to repo or directory
git: # Optional (git repos only) — pin to a branch or commit
branch: main # Branch name; uses local HEAD unless 'remote' is set
# commit: abc1234567 # Mutually exclusive with 'branch'
remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
include_paths: # Optional — filter included paths
- src/
- configs/
Field constraints:
git.branchandgit.commitare mutually exclusive — specify exactly one within thegit:block.git.remoterequiresgit.branch(it has no effect withgit.commit).- If you omit the
git:block, the working tree is packaged as a plain tarball, including any uncommitted changes.
Custom parameters
Passed to the workload via HYPERPARAMETERS_PATH:
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
MLflow run name
mlflow_run_name: 'experiment-001-baseline'
Path resolution
All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.
Folder structure:
/home/username/my-project/
├── train.yaml
└── scripts/
└── train.py
YAML configuration:
experiment_name: my-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: . # Relative to train.yaml
git:
branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py