Use custom Docker images

Important

Custom Docker images for AI Runtime CLI workloads is in Beta.

Docker Container Services (DCS) lets you bring your own Docker container image to air workloads. Use a custom image when you need:

  • Specific system library versions.
  • Complex dependencies that don't fit cleanly into environment.dependencies.
  • An exact environment to reproduce research results.
  • Standard images built by your organization's platform or security team.

Prerequisites

Register an image

Before running a workload with a custom image, register it with air register image. Registration pulls and caches the image in the Databricks platform. Each user must register an image once per image tag. Re-register only when you push a new tag or rotate credentials. Registration takes 2–6 minutes and blocks until the image is ready.

Public images

Register public images by providing the Docker image URL and your Databricks profile:

air register image docker.io/nvidia/cuda:12.9.0-devel-ubuntu24.04 -p my-databricks-profile

The short form image reference also works. For example, library/ubuntu:latest.

Private Docker Hub images

To register a private Docker Hub image, generate a personal access token first. In your Docker Hub account settings, click Personal access tokensGenerate new token. Read-only access is sufficient.

Choose one of the following authentication methods:

Log in to Docker Hub at the terminal. You will be prompted for your Docker Hub username and personal access token:

docker login

This stores your credentials in ~/.docker/config.json. Then register the image — air reads the credentials automatically:

air register image myorg/myrepo:mytag -p my-databricks-profile

Using interactive authentication

Authenticate and store credentials in a Databricks secret scope in one step:

air register image myorg/myrepo:mytag --interactive-authenticate -p my-databricks-profile

You will be prompted for your Docker Hub username and personal access token. Credentials are stored in your workspace secret scope for future registrations.

Store credentials in a Databricks secret and reference it directly:

air register image myorg/myrepo:mytag --scope my-secret-scope --key my-docker-key -p my-databricks-profile

Use a Docker image in a workload

Specify the Docker image in your workload YAML under environment.docker_image.url:

experiment_name: my-dcs-training
environment:
  docker_image:
    url: myorg/myrepo:mytag
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: python /app/train.py

When bringing your own Docker image, environment.dependencies and environment.version are not supported. Specifying environment.docker_image.url with either field triggers an error. If you have additional dependencies, install the packages in the Dockerfile instead.

Submit the workload:

air run --file workload.yaml -p my-databricks-profile

Environment variables injected into your container

AI Runtime injects the following environment variables into every container at runtime:

  • NUM_NODES — total number of nodes.
  • LOCAL_WORLD_SIZE — GPUs per node.
  • WORLD_SIZE — total number of processes.
  • POD_RANK — current node rank (0-indexed). Also injected as NODE_RANK.
  • LOCAL_ADDR — local node IP (multi-node only).
  • MASTER_ADDR — rank-0 coordination address (multi-node only).
  • MASTER_PORT — rank-0 coordination port (multi-node only).

Examples

Single-node A10

experiment_name: my-dcs-single-node
environment:
  docker_image:
    url: myorg/myrepo:mytag
compute:
  num_accelerators: 1
  accelerator_type: GPU_1xA10
command: python3 /app/train.py

Multi-node H100 with RDMA

For multi-node H100 jobs that need full network bandwidth on AWS p5 instances, base your image on one of the Databricks base images with NCCL and EFA preconfigured:

experiment_name: my-dcs-distributed
environment:
  docker_image:
    url: myorg/myrepo:mytag
compute:
  num_accelerators: 16 # 2 nodes × 8 H100
  accelerator_type: GPU_8xH100
command: |-
  torchrun \
    --nnodes="${NUM_NODES}" \
    --nproc_per_node="${LOCAL_WORLD_SIZE}" \
    --node_rank="${POD_RANK}" \
    --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" \
    /app/train.py

Build your own image

Databricks base images

Databricks publishes base images on Docker Hub at databricksruntime/air with CUDA, NCCL, and cloud-specific networking (AWS EFA or Azure InfiniBand) preconfigured.

Tag Cloud Variant Use when
dcs-base-aws-runtime AWS Runtime Installing pre-built wheels only
dcs-base-aws-devel AWS Devel Compiling CUDA extensions (requires nvcc)
dcs-base-azure-runtime Azure Runtime Installing pre-built wheels only
dcs-base-azure-devel Azure Devel Compiling CUDA extensions (requires nvcc)

Use the runtime variant unless your Dockerfile compiles CUDA extensions such as flash-attn, apex, or custom kernels.

Example Dockerfile adding PyTorch to a Databricks base image. The base images provide Python at /opt/venv, managed by uv. uv pip install targets that environment by default; to use a different environment, create and activate a venv before running uv pip install.

FROM databricksruntime/air:dcs-base-aws-runtime

RUN uv pip install --no-cache \
    torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0

RUN uv pip install --no-cache \
    transformers==4.45.0 \
    accelerate==0.34.0 \
    'mlflow>=3.6'

COPY ./train /app/train

Build, push, and register:

docker build -t myorg/myrepo:mytag .
docker push myorg/myrepo:mytag
air register image myorg/myrepo:mytag --interactive-authenticate -p my-databricks-profile

Requirements

  • Images must be hosted on Docker Hub. Amazon ECR, Google GCR, and GitHub GHCR are not supported.
  • Image size must be under 20 GB.
  • WORKDIR is not honored at runtime. Use absolute paths for files baked into the image. For example, use python /app/train.py, not python train.py.
  • You cannot use environment.dependencies or environment.version with environment.docker_image.url. If you need extra packages beyond what is in the image, you must add them to the Dockerfile.

See also