通过


Databricks 资产捆绑包配置

本文介绍 Databricks 资产捆绑包配置文件的语法,该文件定义 Databricks 资产捆绑包。 请参阅什么是 Databricks 资产捆绑包?

若要创建和使用捆绑包,请参阅 开发 Databricks 资产捆绑包

有关捆绑配置参考,请参阅 配置参考

databricks.yml

捆绑包必须包含一个(且只有一个)配置文件,该文件在捆绑项目文件夹的根目录中命名 databricks.ymldatabricks.yml 是定义捆绑包的主配置文件,但它可以在映射中 include 引用其他配置文件,例如资源配置文件。 捆绑配置以 YAML 表示。 有关 YAML 的详细信息,请参阅官方 YAML 规范

最简单的 databricks.yml 定义捆绑包名称,这是所需的顶级映射和目标部署。

bundle:
  name: my_bundle

targets:
  dev:
    default: true

有关所有顶级映射的详细信息,请参阅 配置参考

提示

借助对 Databricks 资产捆绑包的 Python 支持,可以在 Python 中定义资源。 请参阅 Python 中的捆绑配置

规范

以下 YAML 规范提供了 Databricks 资产捆绑包的顶级配置键。 有关配置参考,请参阅配置参考

# This is the default bundle configuration if not otherwise overridden in
# the "targets" top-level mapping.
bundle: # Required.
  name: string # Required.
  databricks_cli_version: string
  cluster_id: string
  deployment: Map
  git:
    origin_url: string
    branch: string

# This is the identity to use to run the bundle
run_as:
  - user_name: <user-name>
  - service_principal_name: <service-principal-name>

# These are any additional configuration files to include.
include:
  - '<some-file-or-path-glob-to-include>'
  - '<another-file-or-path-glob-to-include>'

# These are any scripts that can be run.
scripts:
  <some-unique-script-name>:
    content: string

# These are any additional files or paths to include or exclude.
sync:
  include:
    - '<some-file-or-path-glob-to-include>'
    - '<another-file-or-path-glob-to-include>'
  exclude:
    - '<some-file-or-path-glob-to-exclude>'
    - '<another-file-or-path-glob-to-exclude>'
  paths:
    - '<some-file-or-path-to-synchronize>'

# These are the default artifact settings if not otherwise overridden in
# the targets top-level mapping.
artifacts:
  <some-unique-artifact-identifier>:
    build: string
    dynamic_version: boolean
    executable: string
    files:
      - source: string
    path: string
    type: string

# These are for any custom variables for use throughout the bundle.
variables:
  <some-unique-variable-name>:
    description: string
    default: string or complex
    lookup: Map
    type: string # The only valid value is "complex" if the variable is a complex variable, otherwise do not define this key.

# These are the default workspace settings if not otherwise overridden in
# the targets top-level mapping.
workspace:
  artifact_path: string
  auth_type: string
  azure_client_id: string # For Azure Databricks only.
  azure_environment: string # For Azure Databricks only.
  azure_login_app_id: string # For Azure Databricks only. Reserved for future use.
  azure_tenant_id: string # For Azure Databricks only.
  azure_use_msi: true | false # For Azure Databricks only.
  azure_workspace_resource_id: string # For Azure Databricks only.
  client_id: string # For Databricks on AWS only.
  file_path: string
  google_service_account: string # For Databricks on Google Cloud only.
  host: string
  profile: string
  resource_path: string
  root_path: string
  state_path: string

# These are the permissions to apply to resources defined
# in the resources mapping.
permissions:
  - level: <permission-level>
    group_name: <unique-group-name>
  - level: <permission-level>
    user_name: <unique-user-name>
  - level: <permission-level>
    service_principal_name: <unique-principal-name>

# These are the resource settings if not otherwise overridden in
# the targets top-level mapping.
resources:
  apps:
    <unique-app-name>:
      # See the REST API create request payload reference for apps.
  clusters:
    <unique-cluster-name>:
      # See the REST API create request payload reference for clusters.
  dashboards:
    <unique-dashboard-name>:
      # See the REST API create request payload reference for dashboards.
  experiments:
    <unique-experiment-name>:
      # See the REST API create request payload reference for experiments.
  jobs:
    <unique-job-name>:
      # See REST API create request payload reference for jobs.
  model_serving_endpoint:
    <unique-model-serving-endpoint-name>:
    # See the model serving endpoint request payload reference.
  models:
    <unique-model-name>:
      # See the REST API create request payload reference for models (legacy).
  pipelines:
    <unique-pipeline-name>:
      # See the REST API create request payload reference for :re[LDP] (pipelines).
  quality_monitors:
    <unique-quality-monitor-name>:
    # See the quality monitor request payload reference.
  registered_models:
    <unique-registered-model-name>:
    # See the registered model request payload reference.
  schemas:
    <unique-schema-name>:
      # See the Unity Catalog schema request payload reference.
  secret_scopes:
    <unique-secret-scope-name>:
      # See the secret scope request payload reference.
  volumes:
    <unique-volume-name>:
    # See the Unity Catalog volume request payload reference.

# These are the targets to use for deployments and workflow runs. One and only one of these
# targets can be set to "default: true".
targets:
  <some-unique-programmatic-identifier-for-this-target>:
    artifacts:
      # See the preceding "artifacts" syntax.
    bundle:
      # See the preceding "bundle" syntax.
    default: boolean
    git: Map
    mode: string
    permissions:
      # See the preceding "permissions" syntax.
    presets:
      <preset>: <value>
    resources:
      # See the preceding "resources" syntax.
    sync:
      # See the preceding "sync" syntax.
    variables:
      <preceding-unique-variable-name>: <non-default-value>
    workspace:
      # See the preceding "workspace" syntax.
    run_as:
      # See the preceding "run_as" syntax.

示例

本部分包含一些基本示例,可帮助你了解捆绑包的工作原理以及如何构建配置。

注意

有关演示捆绑包功能和常见捆绑包用例的配置示例,请参阅捆绑包配置示例GitHub 中的捆绑示例存储库

以下示例捆绑配置指定本地文件 hello.py 居于与捆绑配置文件 databricks.yml 相同的目录中。 它使用具有指定群集 ID 的远程群集将此笔记本作为作业运行。 远程工作区 URL 和工作区身份验证凭据是从调用方名为 的本地DEFAULT中读取的。

bundle:
  name: hello-bundle

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
        - task_key: hello-task
          existing_cluster_id: 1234-567890-abcde123
          notebook_task:
            notebook_path: ./hello.py

targets:
  dev:
    default: true

下面的示例添加了一个名称为 prod 的目标,该目标使用不同的远程工作区 URL 和工作区身份验证凭据,这些凭据是从调用者的 .databrickscfg 文件中与指定工作区 URL 匹配的 host 条目中读取的。 此作业运行相同的笔记本,但使用具有指定群集 ID 的不同远程群集。

注意

Databricks 建议尽可能使用 host 映射而不是 default 映射,因为这样可以使捆绑包配置文件更易于移植。 设置 host 映射会指示 Databricks CLI 在 .databrickscfg 文件中查找匹配的配置文件,然后使用该配置文件的字段来确定要使用的 Databricks 身份验证类型。 如果存在具有匹配 host 字段的多个配置文件,则必须使用 --profile 捆绑命令上的选项来指定要使用的配置文件。

请注意,不需要在 notebook_task 映射中声明 prod 映射,因为如果未显式替代 notebook_task 映射中的 resources 映射,它会回退为使用顶级 notebook_task 映射中的 prod 映射。

bundle:
  name: hello-bundle

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
        - task_key: hello-task
          existing_cluster_id: 1234-567890-abcde123
          notebook_task:
            notebook_path: ./hello.py

targets:
  dev:
    default: true
  prod:
    workspace:
      host: https://<production-workspace-url>
    resources:
      jobs:
        hello-job:
          name: hello-job
          tasks:
            - task_key: hello-task
              existing_cluster_id: 2345-678901-fabcd456

使用以下捆绑包命令dev 目标中验证、部署和运行此作业。 有关捆绑包生命周期的详细信息,请参阅 开发 Databricks 资产捆绑包

# Because the "dev" target is set to "default: true",
# you do not need to specify "-t dev":
databricks bundle validate
databricks bundle deploy
databricks bundle run hello_job

# But you can still explicitly specify it, if you want or need to:
databricks bundle validate
databricks bundle deploy -t dev
databricks bundle run -t dev hello_job

若要改为在 prod 目标中验证、部署和运行此作业,请执行以下操作:

# You must specify "-t prod", because the "dev" target
# is already set to "default: true":
databricks bundle validate
databricks bundle deploy -t prod
databricks bundle run -t prod hello_job

若要实现更模块化和更好地跨捆绑包重复使用定义和设置,请将捆绑配置拆分为单独的文件:

# databricks.yml

bundle:
  name: hello-bundle

include:
  - '*.yml'
# hello-job.yml

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
        - task_key: hello-task
          existing_cluster_id: 1234-567890-abcde123
          notebook_task:
            notebook_path: ./hello.py
# targets.yml

targets:
  dev:
    default: true
  prod:
    workspace:
      host: https://<production-workspace-url>
    resources:
      jobs:
        hello-job:
          name: hello-job
          tasks:
            - task_key: hello-task
              existing_cluster_id: 2345-678901-fabcd456