번들 구성 예제

아티클
10/15/2024

이 문서에서는 Databricks 자산 번들 기능 및 일반적인 번들 사용 사례에 대한 예제 구성을 제공합니다.

팁

이 문서의 일부 예제와 다른 예제는 번들 예제 리포지토리에서 찾을 수 있습니다.

서버리스 컴퓨팅을 사용하는 작업

Databricks 자산 번들은 서버리스 컴퓨팅에서 실행되는 작업을 지원합니다. 이를 구성하려면 작업에 대한 설정을 생략 clusters 하거나 아래 예제와 같이 환경을 지정할 수 있습니다.

# A serverless job (no cluster definition)
resources:
  jobs:
    serverless_job_no_cluster:
      name: serverless_job_no_cluster

      email_notifications:
        on_failure:
          - someone@example.com

      tasks:
        - task_key: notebook_task
          notebook_task:
            notebook_path: ../src/notebook.ipynb

# A serverless job (environment spec)
resources:
  jobs:
    serverless_job_environment:
      name: serverless_job_environment

      tasks:
        - task_key: task
          spark_python_task:
            python_file: ../src/main.py

          # The key that references an environment spec in a job.
          environment_key: default

      # A list of task execution environment specifications that can be referenced by tasks of this job.
      environments:
        - environment_key: default

          # Full documentation of this spec can be found at:
          # https://docs.databricks.com/api/workspace/jobs/create#environments-spec
          spec:
            client: "1"
            dependencies:
              - cowsay

서버리스 컴퓨팅을 사용하는 파이프라인

Databricks 자산 번들은 서버리스 컴퓨팅에서 실행되는 파이프라인을 지원합니다. 이를 구성하려면 파이프라인 serverless 설정을 true.로 설정합니다. 다음 예제 구성은 서버리스 컴퓨팅에서 실행되는 파이프라인과 매시간 파이프라인의 새로 고침을 트리거하는 작업을 정의합니다.

# A pipeline that runs on serverless compute
resources:
  pipelines:
    my_pipeline:
      name: my_pipeline
      target: ${bundle.environment}
      serverless: true
      catalog: users
      libraries:
        - notebook:
            path: ../src/my_pipeline.ipynb

      configuration:
        bundle.sourcePath: /Workspace/${workspace.file_path}/src

# This defines a job to refresh a pipeline that is triggered every hour
resources:
  jobs:
    my_job:
      name: my_job

      # Run this job once an hour.
      trigger:
        periodic:
          interval: 1
          unit: HOURS

      email_notifications:
        on_failure:
          - someone@example.com

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.my_pipeline.id}

SQL Notebook을 사용하여 작업

다음 예제 구성은 SQL Notebook을 사용하여 작업을 정의합니다.

resources:
  jobs:
    job_with_sql_notebook:
      name: Job to demonstrate using a SQL notebook with a SQL warehouse

      tasks:
        - task_key: notebook
          notebook_task:
            notebook_path: ./select.sql
            warehouse_id: 799f096837fzzzz4

여러 휠 파일이 있는 작업

다음 예제 구성은 여러 *.whl 파일이 있는 작업이 포함된 번들을 정의합니다.

# job.yml
resources:
  jobs:
    example_job:
      name: "Example with multiple wheels"
      tasks:
        - task_key: task

          spark_python_task:
            python_file: ../src/call_wheel.py

          libraries:
            - whl: ../my_custom_wheel1/dist/*.whl
            - whl: ../my_custom_wheel2/dist/*.whl

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 0
            spark_version: 14.3.x-scala2.12
            spark_conf:
                "spark.databricks.cluster.profile": "singleNode"
                "spark.master": "local[*, 4]"
            custom_tags:
                "ResourceClass": "SingleNode"

# databricks.yml
bundle:
  name: job_with_multiple_wheels

include:
  - ./resources/job.yml

workspace:
  host: https://myworkspace.cloud.databricks.com

artifacts:
  my_custom_wheel1:
    type: whl
    build: poetry build
    path: ./my_custom_wheel1

  my_custom_wheel2:
    type: whl
    build: poetry build
    path: ./my_custom_wheel2

targets:
  dev:
    default: true
    mode: development

requirements.txt 파일을 사용하는 작업

다음 예제 구성은 requirements.txt 파일을 사용하는 작업을 정의합니다.

resources:
  jobs:
    job_with_requirements_txt:
      name: Example job that uses a requirements.txt file

      tasks:
        - task_key: task
          job_cluster_key: default
          spark_python_task:
            python_file: ../src/main.py
          libraries:
            - requirements: /Workspace/${workspace.file_path}/requirements.txt

UNITY 카탈로그에 JAR 파일을 업로드하는 번들

JAR 파일 및 휠 파일과 같은 모든 아티팩트가 Unity 카탈로그 볼륨에 업로드되도록 Unity 카탈로그 볼륨을 아티팩트 경로로 지정할 수 있습니다. 다음 예제 번들은 JAR 파일을 Unity 카탈로그에 업로드합니다. 매핑에 대한 artifact_path 자세한 내용은 artifact_path 참조하세요.

bundle:
  name: jar-bundle

workspace:
  host: https://myworkspace.cloud.databricks.com
  artifact_path: /Volumes/main/default/my_volume

artifacts:
  my_java_code:
    path: ./sample-java
    build: "javac PrintArgs.java && jar cvfm PrintArgs.jar META-INF/MANIFEST.MF PrintArgs.class"
    files:
      - source: ./sample-java/PrintArgs.jar

resources:
  jobs:
    jar_job:
      name: "Spark Jar Job"
      tasks:
        - task_key: SparkJarTask
          new_cluster:
            num_workers: 1
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
          spark_jar_task:
            main_class_name: PrintArgs
          libraries:
            - jar: ./sample-java/PrintArgs.jar

다음을 통해 공유