本文介绍 Databricks 资产捆绑包配置文件的语法,该文件定义 Databricks 资产捆绑包。 请参阅什么是 Databricks 资产捆绑包?。
若要创建和使用捆绑包,请参阅 开发 Databricks 资产捆绑包。
有关捆绑配置参考,请参阅 配置参考。
databricks.yml
捆绑包必须包含一个(且只有一个)配置文件,该文件在捆绑项目文件夹的根目录中命名 databricks.yml 。
databricks.yml 是定义捆绑包的主配置文件,但它可以在映射中 include 引用其他配置文件,例如资源配置文件。 捆绑配置以 YAML 表示。 有关 YAML 的详细信息,请参阅官方 YAML 规范。
最简单的 databricks.yml 定义捆绑包名称,这是所需的顶级映射和目标部署。
bundle:
name: my_bundle
targets:
dev:
default: true
有关所有顶级映射的详细信息,请参阅 配置参考。
提示
借助对 Databricks 资产捆绑包的 Python 支持,可以在 Python 中定义资源。 请参阅 Python 中的捆绑配置。
规范
以下 YAML 规范提供了 Databricks 资产捆绑包的顶级配置键。 有关配置参考,请参阅配置参考。
# This is the default bundle configuration if not otherwise overridden in
# the "targets" top-level mapping.
bundle: # Required.
name: string # Required.
databricks_cli_version: string
cluster_id: string
deployment: Map
git:
origin_url: string
branch: string
# This is the identity to use to run the bundle
run_as:
- user_name: <user-name>
- service_principal_name: <service-principal-name>
# These are any additional configuration files to include.
include:
- '<some-file-or-path-glob-to-include>'
- '<another-file-or-path-glob-to-include>'
# These are any scripts that can be run.
scripts:
<some-unique-script-name>:
content: string
# These are any additional files or paths to include or exclude.
sync:
include:
- '<some-file-or-path-glob-to-include>'
- '<another-file-or-path-glob-to-include>'
exclude:
- '<some-file-or-path-glob-to-exclude>'
- '<another-file-or-path-glob-to-exclude>'
paths:
- '<some-file-or-path-to-synchronize>'
# These are the default artifact settings if not otherwise overridden in
# the targets top-level mapping.
artifacts:
<some-unique-artifact-identifier>:
build: string
dynamic_version: boolean
executable: string
files:
- source: string
path: string
type: string
# These are for any custom variables for use throughout the bundle.
variables:
<some-unique-variable-name>:
description: string
default: string or complex
lookup: Map
type: string # The only valid value is "complex" if the variable is a complex variable, otherwise do not define this key.
# These are the default workspace settings if not otherwise overridden in
# the targets top-level mapping.
workspace:
artifact_path: string
auth_type: string
azure_client_id: string # For Azure Databricks only.
azure_environment: string # For Azure Databricks only.
azure_login_app_id: string # For Azure Databricks only. Reserved for future use.
azure_tenant_id: string # For Azure Databricks only.
azure_use_msi: true | false # For Azure Databricks only.
azure_workspace_resource_id: string # For Azure Databricks only.
client_id: string # For Databricks on AWS only.
file_path: string
google_service_account: string # For Databricks on Google Cloud only.
host: string
profile: string
resource_path: string
root_path: string
state_path: string
# These are the permissions to apply to resources defined
# in the resources mapping.
permissions:
- level: <permission-level>
group_name: <unique-group-name>
- level: <permission-level>
user_name: <unique-user-name>
- level: <permission-level>
service_principal_name: <unique-principal-name>
# These are the resource settings if not otherwise overridden in
# the targets top-level mapping.
resources:
apps:
<unique-app-name>:
# See the REST API create request payload reference for apps.
clusters:
<unique-cluster-name>:
# See the REST API create request payload reference for clusters.
dashboards:
<unique-dashboard-name>:
# See the REST API create request payload reference for dashboards.
experiments:
<unique-experiment-name>:
# See the REST API create request payload reference for experiments.
jobs:
<unique-job-name>:
# See REST API create request payload reference for jobs.
model_serving_endpoint:
<unique-model-serving-endpoint-name>:
# See the model serving endpoint request payload reference.
models:
<unique-model-name>:
# See the REST API create request payload reference for models (legacy).
pipelines:
<unique-pipeline-name>:
# See the REST API create request payload reference for :re[LDP] (pipelines).
quality_monitors:
<unique-quality-monitor-name>:
# See the quality monitor request payload reference.
registered_models:
<unique-registered-model-name>:
# See the registered model request payload reference.
schemas:
<unique-schema-name>:
# See the Unity Catalog schema request payload reference.
secret_scopes:
<unique-secret-scope-name>:
# See the secret scope request payload reference.
volumes:
<unique-volume-name>:
# See the Unity Catalog volume request payload reference.
# These are the targets to use for deployments and workflow runs. One and only one of these
# targets can be set to "default: true".
targets:
<some-unique-programmatic-identifier-for-this-target>:
artifacts:
# See the preceding "artifacts" syntax.
bundle:
# See the preceding "bundle" syntax.
default: boolean
git: Map
mode: string
permissions:
# See the preceding "permissions" syntax.
presets:
<preset>: <value>
resources:
# See the preceding "resources" syntax.
sync:
# See the preceding "sync" syntax.
variables:
<preceding-unique-variable-name>: <non-default-value>
workspace:
# See the preceding "workspace" syntax.
run_as:
# See the preceding "run_as" syntax.
示例
本部分包含一些基本示例,可帮助你了解捆绑包的工作原理以及如何构建配置。
注意
有关演示捆绑包功能和常见捆绑包用例的配置示例,请参阅捆绑包配置示例和 GitHub 中的捆绑示例存储库。
以下示例捆绑配置指定本地文件 hello.py 居于与捆绑配置文件 databricks.yml 相同的目录中。 它使用具有指定群集 ID 的远程群集将此笔记本作为作业运行。 远程工作区 URL 和工作区身份验证凭据是从调用方名为 的本地DEFAULT中读取的。
bundle:
name: hello-bundle
resources:
jobs:
hello-job:
name: hello-job
tasks:
- task_key: hello-task
existing_cluster_id: 1234-567890-abcde123
notebook_task:
notebook_path: ./hello.py
targets:
dev:
default: true
下面的示例添加了一个名称为 prod 的目标,该目标使用不同的远程工作区 URL 和工作区身份验证凭据,这些凭据是从调用者的 .databrickscfg 文件中与指定工作区 URL 匹配的 host 条目中读取的。 此作业运行相同的笔记本,但使用具有指定群集 ID 的不同远程群集。
注意
Databricks 建议尽可能使用 host 映射而不是 default 映射,因为这样可以使捆绑包配置文件更易于移植。 设置 host 映射会指示 Databricks CLI 在 .databrickscfg 文件中查找匹配的配置文件,然后使用该配置文件的字段来确定要使用的 Databricks 身份验证类型。 如果存在具有匹配 host 字段的多个配置文件,则必须使用 --profile 捆绑命令上的选项来指定要使用的配置文件。
请注意,不需要在 notebook_task 映射中声明 prod 映射,因为如果未显式替代 notebook_task 映射中的 resources 映射,它会回退为使用顶级 notebook_task 映射中的 prod 映射。
bundle:
name: hello-bundle
resources:
jobs:
hello-job:
name: hello-job
tasks:
- task_key: hello-task
existing_cluster_id: 1234-567890-abcde123
notebook_task:
notebook_path: ./hello.py
targets:
dev:
default: true
prod:
workspace:
host: https://<production-workspace-url>
resources:
jobs:
hello-job:
name: hello-job
tasks:
- task_key: hello-task
existing_cluster_id: 2345-678901-fabcd456
使用以下捆绑包命令在 dev 目标中验证、部署和运行此作业。 有关捆绑包生命周期的详细信息,请参阅 开发 Databricks 资产捆绑包。
# Because the "dev" target is set to "default: true",
# you do not need to specify "-t dev":
databricks bundle validate
databricks bundle deploy
databricks bundle run hello_job
# But you can still explicitly specify it, if you want or need to:
databricks bundle validate
databricks bundle deploy -t dev
databricks bundle run -t dev hello_job
若要改为在 prod 目标中验证、部署和运行此作业,请执行以下操作:
# You must specify "-t prod", because the "dev" target
# is already set to "default: true":
databricks bundle validate
databricks bundle deploy -t prod
databricks bundle run -t prod hello_job
若要实现更模块化和更好地跨捆绑包重复使用定义和设置,请将捆绑配置拆分为单独的文件:
# databricks.yml
bundle:
name: hello-bundle
include:
- '*.yml'
# hello-job.yml
resources:
jobs:
hello-job:
name: hello-job
tasks:
- task_key: hello-task
existing_cluster_id: 1234-567890-abcde123
notebook_task:
notebook_path: ./hello.py
# targets.yml
targets:
dev:
default: true
prod:
workspace:
host: https://<production-workspace-url>
resources:
jobs:
hello-job:
name: hello-job
tasks:
- task_key: hello-task
existing_cluster_id: 2345-678901-fabcd456