Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
The AI Runtime CLI is in Beta.
This page walks through submitting your first training job with the AI Runtime CLI. Before starting, install the CLI and configure authentication.
Step 1: Write a YAML config
Create train.yaml describing the workload. The minimal config requires an experiment name, a compute spec, and a command. The command below runs without any local code, so you can submit your first run right away:
experiment_name: my-first-air-run
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "hello AIR!"
Run your own code
To run a local training script, add an environment block that lists your Python dependencies and a code_source block that uploads your local code. Place your script alongside train.yaml:
my-project/
├── train.yaml
└── train.py
experiment_name: my-first-air-run
environment:
version: '4'
dependencies:
- torch
- transformers
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
code_source:
type: snapshot
snapshot:
root_path: .
command: python $CODE_SOURCE_PATH/train.py
This config installs the listed dependencies, uploads the current directory (root_path: .), and runs train.py on a single A10 GPU. $CODE_SOURCE_PATH resolves to the uploaded code location on the remote node. Databricks recommends using this rather than hardcoding a path. environment.version selects the serverless GPU environment version and is optional (defaults to '4'). For all available versions, see Serverless environment versions.
For the full field reference, see Workload YAML reference.
Step 2: Submit the run
Submit the workload:
air run --file train.yaml
The CLI uploads your local code (if you configured a code_source), submits the job, and prints a run ID. Use that ID to inspect, watch, and cancel the run in later commands.
The submission creates a run in the MLflow experiment named in experiment_name (an experiment can hold many runs). That run captures the workload's metrics, parameters, artifacts, and logs, all viewable in the workspace MLflow UI. Logs are also available outside MLflow: stream them to your terminal or a file, or download them later with air logs (see Step 3).
To watch logs until completion, add --watch:
air run --file train.yaml --watch
Step 3: Inspect the run
Check status:
air get run <run-id>
The output includes clickable links to the run's MLflow experiment and MLflow run in the workspace UI.
Stream or download logs:
air logs <run-id>
air logs <run-id> --node 2
air logs <run-id> --download-to ./logs/
Distributed workloads run across multiple nodes. By default, air logs streams from node 0. To view logs from a specific node, pass --node. Use --download-to to write logs to a local directory instead of streaming them.
List recent runs:
air list runs --limit 10
air list runs --active
Cancel a run:
air cancel <run-id>
Common patterns
Override YAML fields from the command line:
air run --file train.yaml --override compute.num_accelerators=32 timeout_minutes=120
Validate the config without submitting:
air run --file train.yaml --dry-run
Make a submission safely retryable:
air run --file train.yaml --idempotency-key my-unique-key
If the same key has been used before, the existing run is returned instead of creating a new one.