Workload YAML reference

Important

The AI Runtime CLI is in Beta.

This page is the reference for workload YAML configurations passed to air run --file.

Note

The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.

Minimal configuration

experiment_name: my-training
environment:
  dependencies: requirements.yaml
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: echo "Hello World"

Submit with:

air run --file train.yaml -p profile

Core concepts

Core fields

Most training configurations include five components:

  1. experiment_name: Required. Creates or appends to an MLflow experiment.
  2. environment: Optional. Python dependencies and base environment.
  3. compute: Required. GPU resources (type and count).
  4. command: Required. The bash command or commands used to launch training.
  5. code_source: Optional. Path to your training code, made available remotely.

Your first training job

experiment_name: simple-training
environment:
  dependencies: requirements.yaml
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py

In this configuration:

  • experiment_name creates an MLflow experiment named simple-training (or appends a new run if it already exists).
  • environment installs dependencies from requirements.yaml.
  • compute allocates one H100 node (8 H100 GPUs).
  • code_source uploads the folder repo to the node, available at $CODE_SOURCE_PATH.
  • command runs train.py via torchrun across the 8 H100 GPUs. The file lives at /home/username/repo/train.py locally.

Common use cases

Add environment variables

experiment_name: training-with-env
environment:
  dependencies: requirements.yaml
env_variables:
  BATCH_SIZE: '32'
  LEARNING_RATE: '0.001'
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py

Use secrets (API keys, tokens)

experiment_name: training-with-secrets
environment:
  dependencies: requirements.yaml
secrets:
  HF_TOKEN: 'my_scope/hf_token'
  WANDB_API_KEY: 'my_scope/wandb'
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py

Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.

When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.

Work with code sources

The code_source block uploads local code so the training job can run it.

  • root_path is the local directory to snapshot. By default, air packages the working tree as-is (including any uncommitted changes) as a plain tarball.
  • To snapshot a pinned git version instead, add a git: block with a branch or commit. This requires root_path to be a git repository and enables version-aware snapshotting (caching, git archive).
  • For large repositories, include_paths lets you snapshot a subset.

Minimal example

experiment_name: simple-training
environment:
  dependencies: requirements.yaml
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py

On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path — use it in your command rather than hard-coding the location.

Git repositories: pin by branch or commit

For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive — specify exactly one within the block.

Pin to a branch (uses the local HEAD of that branch):

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh

Pin to a commit SHA (exact reproducibility):

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      commit: abc1234567 # Pins specific commit
command: train.sh

Key fields:

  • root_path (Required) — Local path to the root of your git repository.
  • git.branch (Optional) — Branch name. Uses local HEAD; no remote fetch. Mutually exclusive with git.commit.
  • git.commit (Optional) — Specific commit SHA. Mutually exclusive with git.branch.
  • git.remote (Optional) — Use the branch's remote HEAD instead of the local one. Set to true to auto-detect the remote, or to a remote name (for example, upstream) to fetch from a specific remote. Only valid with git.branch.

If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes — no extra field is required.

Non-git directories

You can snapshot directories that aren't git repositories. Omit the git: block — it requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py

Folder filtering with include_paths

For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    include_paths:
      - research/models
      - research/common
      - research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py

Key points:

  • The field is optional. If omitted, the entire repository is included by default.
  • Paths must be relative to the repository root (no leading /).
  • .. is not allowed; you cannot reference parent directories.

Advanced features

Custom hyperparameters

Pass structured configuration to your training script via HYPERPARAMETERS_PATH:

experiment_name: parameterized-training
environment:
  dependencies: requirements.yaml
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
  model:
    name: 'gpt2'
    hidden_size: 768
  training:
    batch_size: 32
    learning_rate: 0.0001

Read them in your script:

import os
import yaml

with open(os.environ['HYPERPARAMETERS_PATH']) as f:
    params = yaml.safe_load(f)

learning_rate = params['training']['learning_rate']
model_name = params['model']['name']

Job reliability

experiment_name: reliable-training
environment:
  dependencies: requirements.yaml
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90

If the workload fails, it is retried twice. Each attempt has 90 minutes to complete — the total wall-clock budget is 90 × 3 = 270 minutes.

Cost attribution

Attach a workload to an existing budget policy via usage_policy_id. For setup, see Attribute usage with serverless usage policies.

experiment_name: my-training
environment:
  dependencies: requirements.yaml
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_id: abcd123-25b8-3e87-9a2c-f86eb19d101c

Reference

Core fields

Field Type Description Example
experiment_name string Experiment name for MLflow. "my-training-job"
environment.dependencies string Path to requirements.yaml. "requirements.yaml"
compute.num_accelerators int Number of GPUs. 1, 4, 8
compute.accelerator_type string GPU type. "GPU_1xA10", "GPU_8xH100"
code_source dict Code source configuration. See Work with code sources.
command string Bash commands to launch training. torchrun --nproc_per_node=8 train.py

Supported GPU types

accelerator_type GPUs per node Notes
GPU_1xA10 1 Single A10 — good for development and small workloads.
GPU_1xH100 1 Single H100.
GPU_8xH100 8 Full H100 node — typical for distributed training.

For accelerator capabilities and recommended use cases, see Hardware options.

Optional fields

Environment configuration

environment:
  dependencies: requirements.yaml
env_variables:
  BATCH_SIZE: '32'
secrets:
  HF_TOKEN: 'my_scope/hf_token'

For the dependencies file format, see requirements.yaml reference.

Code source configuration

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo # REQUIRED — local path to repo or directory
    git: # Optional (git repos only) — pin to a branch or commit
      branch: main # Branch name; uses local HEAD unless 'remote' is set
      # commit: abc1234567 # Mutually exclusive with 'branch'
      remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
    include_paths: # Optional — filter included paths
      - src/
      - configs/

Field constraints:

  • git.branch and git.commit are mutually exclusive — specify exactly one within the git: block.
  • git.remote requires git.branch (it has no effect with git.commit).
  • If you omit the git: block, the working tree is packaged as a plain tarball, including any uncommitted changes.

Custom parameters

Passed to the workload via HYPERPARAMETERS_PATH:

parameters:
  model:
    name: 'gpt2'
    hidden_size: 768
  training:
    batch_size: 32

MLflow run name

mlflow_run_name: 'experiment-001-baseline'

Path resolution

All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.

Folder structure:

/home/username/my-project/
├── train.yaml
├── requirements.yaml
└── scripts/
    └── train.py

YAML configuration:

experiment_name: my-training
environment:
  dependencies: requirements.yaml # Relative to train.yaml
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: . # Relative to train.yaml
    git:
      branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py