Customize containers with Databricks Container Services
Databricks Container Services lets you specify a Docker image when you create a cluster. Some example use cases include:
- Library customization: you have full control over the system libraries you want installed.
- Golden container environment: your Docker image is a locked down environment that will never change.
- Docker CI/CD integration: you can integrate Azure Databricks with your Docker CI/CD pipelines.
You can also use Docker images to create custom deep learning environments on clusters with GPU devices. For additional information about using GPU clusters with Databricks Container Services, see Databricks Container Services on GPU clusters.
For tasks to be executed each time the container starts, use an init script.
Requirements
Note
Databricks Runtime for Machine Learning does not support Databricks Container Services.
- Your Azure Databricks workspace must have Databricks Container Services enabled.
- Your machine must be running a recent Docker daemon (one that is tested and works with Client/Server Version 18.03.0-ce) and the
docker
command must be available on yourPATH
.
Step 1: Build your base
Databricks recommends that you build your Docker base from a base that Databricks has built and tested. It is also possible to build your Docker base from scratch. This section describes the two options.
Option 1. Use a base built by Databricks
This example uses the 9.x
tag for an image that will target a cluster with runtime version Databricks Runtime 9.0 and above:
FROM databricksruntime/standard:9.x
...
To specify additional Python libraries, such as the latest version of pandas and urllib, use the container-specific version of pip
. For the datatabricksruntime/standard:9.x
container, include the following:
RUN /databricks/python3/bin/pip install pandas
RUN /databricks/python3/bin/pip install urllib3
For the datatabricksruntime/standard:8.x
container or lower, include the following:
RUN /databricks/conda/envs/dcs-minimal/bin/pip install pandas
RUN /databricks/conda/envs/dcs-minimal/bin/pip install urllib3
Base images are hosted on Docker Hub at https://hub.docker.com/u/databricksruntime. The Dockerfiles used to generate these bases are at https://github.com/databricks/containers.
Note
Docker Hub hosted images with Tags with “-LTS” suffix will be patched. All other images are examples and are not patched regularly.
Note
The base images databricksruntime/standard
and databricksruntime/minimal
are not to be confused with the unrelated databricks-standard
and databricks-minimal
environments included in the no longer available Databricks Runtime with Conda (Beta).
Option 2. Build your own Docker base
You can also build your Docker base from scratch. The Docker image must meet these requirements:
- JDK 8u191 as Java on the system
PATH
- bash
- iproute2 (ubuntu iproute)
- coreutils (ubuntu coreutils)
- procps (ubuntu procps)
- sudo (ubuntu sudo)
- Ubuntu Linux
To build your own image from scratch, you must create the virtual environment. You must also include packages that are built into Databricks clusters, such as Python and R. To get started, you can use the appropriate base image:
- For R:
databricksruntime/rbase
- For Python:
databricksruntime/python
- For the minimal image built by Databricks:
databricksruntime/minimal
You can also refer to the example Dockerfiles in GitHub.
Note
Databricks recommends using Ubuntu Linux; however, it is possible to use Alpine Linux. To use Alpine Linux, you must include these files:
In addition, you must set up Python, as shown in this example Dockerfile.
Warning
Test your custom container image thoroughly on an Azure Databricks cluster. Your container may work on a local or build machine, but when your container is launched on an Azure Databricks cluster, the cluster launch may fail, certain features may become disabled, or your container may stop working, even silently. In worst-case scenarios, it could corrupt your data or accidentally expose your data to external parties.
Step 2: Push your base image
Push your custom base image to a Docker registry. This process is supported with the following registries:
- Docker Hub with no auth or basic auth.
- Azure Container Registry with basic auth.
Other Docker registries that support no auth or basic auth are also expected to work.
Note
If you use Docker Hub for your Docker registry, be sure to check that rate limits accommodate the number of clusters that you expect to launch in a six-hour period. These rate limits are different for anonymous users, authenticated users without a paid subscription, and paid subscriptions. See the Docker documentation for details. If this limit is exceeded, you will get a “429 Too Many Requests” response.
Step 3: Launch your cluster
You can launch your cluster using the UI or the API.
Launch your cluster using the UI
On the Create Cluster page, specify a Databricks Runtime Version that supports Databricks Container Services.
Under Advanced options, select the Docker tab.
Select Use your own Docker container.
In the Docker Image URL field, enter your custom Docker image.
Docker image URL examples:
Registry Tag format Docker Hub <organization>/<repository>:<tag>
(for example:databricksruntime/standard:latest
)Azure Container Registry <your-registry-name>.azurecr.io/<repository-name>:<tag>
Select the authentication type.
Launch your cluster using the API
Use the Clusters API 2.0 to launch a cluster with your custom Docker base.
curl -X POST -H "Authorization: Bearer <token>" https://<databricks-instance>/api/2.0/clusters/create -d '{ "cluster_name": "<cluster-name>", "num_workers": 0, "node_type_id": "Standard_DS3_v2", "docker_image": { "url": "databricksruntime/standard:latest", "basic_auth": { "username": "<docker-registry-username>", "password": "<docker-registry-password>" } }, "spark_version": "7.3.x-scala2.12", }'
basic_auth
requirements depend on your Docker image type:- For public Docker images, do not include the
basic_auth
field. - For private Docker images, you must include the
basic_auth
field, using a service principal ID and password as the username and password. - For Azure Container Registry, you must set the
basic_auth
field to the ID and password for a service principal. See Azure Container Registry service principal authentication documentation for information about creating the service principal.
- For public Docker images, do not include the
Use an init script
Databricks Container Services clusters enable customers to include init scripts in the Docker container. In most cases, you should avoid init scripts and instead make customizations through Docker directly (using the Dockerfile). However, certain tasks must be executed when the container starts, instead of when the container is built. Use an init script for these tasks.
For example, suppose you want to run a security daemon inside a custom container. Install and build the daemon in the Docker image through your image building pipeline. Then, add an init script that starts the daemon. In this example, the init script would include a line like systemctl start my-daemon
.
In the API, you can specify init scripts as part of the cluster spec as follows. For more information, see InitScriptInfo.
"init_scripts": [
{
"file": {
"destination": "file:/my/local/file.sh"
}
}
]
For Databricks Container Services images, you can also store init scripts in DBFS or cloud storage.
The following steps take place when you launch a Databricks Container Services cluster:
- VMs are acquired from the cloud provider.
- The custom Docker image is downloaded from your repo.
- Azure Databricks creates a Docker container from the image.
- Databricks Runtime code is copied into the Docker container.
- The init scrips are executed. See Init script execution order.
Azure Databricks ignores the Docker CMD
and ENTRYPOINT
primitives.
Feedback
Submit and view feedback for