Workload YAML reference

Important

The AI Runtime CLI is in Beta.

Define a training job's experiment name, compute, command, environment, and code source in the workload YAML config you pass to air run --file. This page documents every field.

Note

The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.

Minimal configuration

experiment_name: my-training
environment:
  dependencies:
    - mlflow
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: echo "Hello World"

Submit with:

air run --file train.yaml -p profile

Core concepts

Core fields

Most training configurations include five components:

  1. experiment_name: Required. Creates or appends to an MLflow experiment.
  2. environment: Optional. Python dependencies and base environment.
  3. compute: Required. GPU resources (type and count).
  4. command: Required. The bash command or commands used to launch training.
  5. code_source: Optional. Path to your training code, made available remotely.

Your first training job

experiment_name: simple-training
environment:
  dependencies:
    - torch
    - transformers
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py

In this configuration:

  • experiment_name creates an MLflow experiment named simple-training (or appends a new run if it already exists).
  • environment installs the listed Python dependencies (here, torch and transformers).
  • compute allocates one H100 node (8 H100 GPUs).
  • code_source uploads the folder repo to the node, available at $CODE_SOURCE_PATH.
  • command runs train.py via torchrun across the 8 H100 GPUs. The file lives at /home/username/repo/train.py locally.

Common use cases

Add environment variables

experiment_name: training-with-env
environment:
  dependencies:
    - torch
    - transformers
env_variables:
  BATCH_SIZE: '32'
  LEARNING_RATE: '0.001'
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py

Use secrets (API keys, tokens)

experiment_name: training-with-secrets
environment:
  dependencies:
    - torch
    - transformers
secrets:
  HF_TOKEN: 'my_scope/hf_token'
  WANDB_API_KEY: 'my_scope/wandb'
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py

Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.

When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.

Python dependencies

List your workload's Python dependencies as an inline list under environment.dependencies:

environment:
  version: '4'
  dependencies:
    - torch
    - transformers

environment.version selects the serverless GPU environment version. It is optional and defaults to "4".

Dependency format

The dependency list follows the Databricks Base Environment Specification. Each entry is a pip-style package spec (for example, my-library==6.1). The list also accepts the following entries:

  • Requirements files: a reference to an existing requirements.txt using -r, for example -r '/Workspace/Shared/requirements.txt'. Environment variables such as $HOME are expanded.
  • Wheels: an absolute path to a .whl file, for example /Workspace/Shared/path/to/simplejson-3.19.3-py3-none-any.whl.
  • Index URLs: an index URL, for example --index-url https://pypi.org/simple.
environment:
  version: '4'
  dependencies:
    - --index-url https://pypi.org/simple
    - -r '/Workspace/Shared/requirements.txt'
    - my-library==6.1
    - /Workspace/Shared/path/to/simplejson-3.19.3-py3-none-any.whl

Supported install flags

Dependencies are installed with uv. The following pip-style flags are supported as list entries:

  • Applied to the whole install: --index-url, --extra-index-url, and --find-links (-f) set or extend the package indexes.
  • Applied to the dependency that follows them: --no-deps, --no-build-isolation, --no-cache-dir, and --force-reinstall. Place the flag on its own line (or before the spec), followed by the dependency it applies to.

For example, to install flash-attn against the already-installed torch (no build isolation) and without resolving its own dependencies:

environment:
  version: '4'
  dependencies:
    - torch
    - --no-build-isolation
    - --no-deps
    - flash-attn

Note

--trusted-host is not supported. Because uv configures trust per index URL, use --index-url or --extra-index-url instead.

Custom Docker images

As an alternative to environment.dependencies, you can specify a custom Docker container image using environment.docker_image.url. environment.docker_image.url is mutually exclusive with both environment.dependencies and environment.version — you cannot use either in the same workload.

experiment_name: my-dcs-training
environment:
  docker_image:
    url: myorg/myrepo:mytag
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: python /app/train.py

Before using a custom image, register it with air register image. For full details, including image requirements, Databricks base images, and Dockerfile patterns, see Use custom Docker images.

Work with code sources

The code_source block uploads local code so the training job can run it.

  • root_path is the local directory to snapshot. By default, air packages the working tree as-is (including any uncommitted changes) as a plain tarball.
  • To snapshot a pinned git version instead, add a git: block with a branch or commit. This requires root_path to be a git repository and enables version-aware snapshotting (caching, git archive).
  • For large repositories, include_paths lets you snapshot a subset.

Minimal example

experiment_name: simple-training
environment:
  dependencies:
    - torch
    - transformers
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py

On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path, so use it in your command rather than hard-coding the location.

Git repositories: pin by branch or commit

For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive: specify exactly one within the block.

Pin to a branch (uses the local HEAD of that branch):

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh

Pin to a commit SHA (exact reproducibility):

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      commit: abc1234567 # Pins specific commit
command: train.sh

Key fields:

  • root_path (Required): Local path to the root of your git repository.
  • git.branch (Optional): Branch name. Uses local HEAD; no remote fetch. Mutually exclusive with git.commit.
  • git.commit (Optional): Specific commit SHA. Mutually exclusive with git.branch.
  • git.remote (Optional): Use the branch's remote HEAD instead of the local one. Set to true to auto-detect the remote, or to a remote name (for example, upstream) to fetch from a specific remote. Only valid with git.branch.

If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes. No extra field is required.

Non-git directories

You can snapshot directories that aren't git repositories. Omit the git: block, which requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py

Folder filtering with include_paths

For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    include_paths:
      - research/models
      - research/common
      - research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py

Key points:

  • The field is optional. If omitted, the entire repository is included by default.
  • Paths must be relative to the repository root (no leading /).
  • .. is not allowed; you cannot reference parent directories.

Advanced features

Custom hyperparameters

Pass structured configuration to your training script via HYPERPARAMETERS_PATH:

experiment_name: parameterized-training
environment:
  dependencies:
    - torch
    - transformers
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
  model:
    name: 'gpt2'
    hidden_size: 768
  training:
    batch_size: 32
    learning_rate: 0.0001

Read them in your script:

import os
import yaml

with open(os.environ['HYPERPARAMETERS_PATH']) as f:
    params = yaml.safe_load(f)

learning_rate = params['training']['learning_rate']
model_name = params['model']['name']

Job reliability

experiment_name: reliable-training
environment:
  dependencies:
    - torch
    - transformers
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo
    git:
      branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90

If the workload fails, it is retried twice. Each attempt has 90 minutes to complete, so the total wall-clock budget is 90 × 3 = 270 minutes.

Cost attribution

Attach a workload to an existing budget policy via usage_policy_name. The name is resolved to the policy's ID when the workload launches. For setup, see Attribute usage with serverless usage policies.

experiment_name: my-training
environment:
  dependencies:
    - mlflow
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_name: my team policy

Reference

Core fields

Field Type Description Example
experiment_name string Experiment name for MLflow. "my-training-job"
environment.dependencies list Inline list of pip dependency specs. ["torch", "transformers"]
environment.version string Serverless GPU environment version. Optional. Defaults to "4". "4"
compute.num_accelerators int Number of GPUs. 1, 4, 8
compute.accelerator_type string GPU type. "GPU_1xA10", "GPU_8xH100"
code_source dict Code source configuration. See Work with code sources.
command string Bash commands to launch training. torchrun --nproc_per_node=8 train.py

Supported GPU types

accelerator_type GPUs per node Notes
GPU_1xA10 1 Single A10, good for development and small workloads.
GPU_1xH100 1 Single H100.
GPU_8xH100 8 Full H100 node, typical for distributed training.

For accelerator capabilities and recommended use cases, see Hardware options.

Optional fields

Environment configuration

environment:
  version: '4'
  dependencies:
    - torch
    - transformers
env_variables:
  BATCH_SIZE: '32'
secrets:
  HF_TOKEN: 'my_scope/hf_token'

For the dependency format, supported install flags, and environment.version, see Python dependencies.

Custom Docker image configuration

environment:
  docker_image:
    url: myorg/myrepo:mytag

Mutually exclusive with environment.dependencies and environment.version. Register the image with air register image before use. See Use custom Docker images.

Code source configuration

code_source:
  type: snapshot
  snapshot:
    root_path: /home/username/repo # REQUIRED — local path to repo or directory
    git: # Optional (git repos only) — pin to a branch or commit
      branch: main # Branch name; uses local HEAD unless 'remote' is set
      # commit: abc1234567 # Mutually exclusive with 'branch'
      remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
    include_paths: # Optional — filter included paths
      - src/
      - configs/

Field constraints:

  • git.branch and git.commit are mutually exclusive: specify exactly one within the git: block.
  • git.remote requires git.branch (it has no effect with git.commit).
  • If you omit the git: block, the working tree is packaged as a plain tarball, including any uncommitted changes.

Custom parameters

Passed to the workload via HYPERPARAMETERS_PATH:

parameters:
  model:
    name: 'gpt2'
    hidden_size: 768
  training:
    batch_size: 32

MLflow run name

mlflow_run_name: 'experiment-001-baseline'

Path resolution

All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.

Folder structure:

/home/username/my-project/
├── train.yaml
└── scripts/
    └── train.py

YAML configuration:

experiment_name: my-training
environment:
  dependencies:
    - torch
    - transformers
compute:
  num_accelerators: 8
  accelerator_type: GPU_8xH100
code_source:
  type: snapshot
  snapshot:
    root_path: . # Relative to train.yaml
    git:
      branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py