使用 Terraform 建立叢集、筆記本和作業

發行項
03/01/2024

本文說明如何使用 Databricks Terraform 提供者，在現有的 Azure Databricks 工作區中建立叢集、筆記本和作業。

本文是下列 Azure Databricks 入門文章的隨附文章：

教學課程：執行端對端 Lakehouse 分析管線，其使用與 Unity 目錄搭配運作的叢集、Python 筆記本，以及執行筆記本的作業。
快速入門：使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 作業，其使用一般用途叢集和 Python 筆記本。

您也可以調整本文中的 Terraform 設定，以在工作區中建立自定義叢集、筆記本和作業。

步驟 1：建立及設定 Terraform 專案

遵循 Databricks Terraform 提供者概觀一文的<需求>一節中的指示，建立 Terraform 專案。

若要建立叢集，請建立名為 cluster.tf的檔案，並將下列內容新增至檔案。此內容會建立允許最少資源的叢集。此叢集使用最新的 Databricks Runtime 長期支援（LTS）版本。

針對與 Unity 目錄搭配運作的叢集：

variable "cluster_name" {}
variable "cluster_autotermination_minutes" {}
variable "cluster_num_workers" {}
variable "cluster_data_security_mode" {}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
  data_security_mode      = var.cluster_data_security_mode
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

針對所有用途的叢集：

variable "cluster_name" {
  description = "A name for the cluster."
  type        = string
  default     = "My Cluster"
}

variable "cluster_autotermination_minutes" {
  description = "How many minutes before automatically terminating due to inactivity."
  type        = number
  default     = 60
}

variable "cluster_num_workers" {
  description = "The number of workers."
  type        = number
  default     = 1
}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

若要建立叢集，請建立另 cluster.auto.tfvars一個名為的檔案，並將下列內容新增至檔案。此檔案包含自定義叢集的變數值。將佔位元值取代為您自己的值。

針對與 Unity 目錄搭配運作的叢集：

cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1
cluster_data_security_mode      = "SINGLE_USER"

針對所有用途的叢集：

cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1

若要建立筆記本，請建立名為 notebook.tf的另一個檔案，並將下列內容新增至檔案：

variable "notebook_subdirectory" {
  description = "A name for the subdirectory to store the notebook."
  type        = string
  default     = "Terraform"
}

variable "notebook_filename" {
  description = "The notebook's filename."
  type        = string
}

variable "notebook_language" {
  description = "The language of the notebook."
  type        = string
}

resource "databricks_notebook" "this" {
  path     = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}"
  language = var.notebook_language
  source   = "./${var.notebook_filename}"
}

output "notebook_url" {
 value = databricks_notebook.this.url
}

如果您要建立叢集，請將下列筆記本程式代碼儲存至與檔案相同的目錄中的 notebook.tf 檔案：

針對適用於教學課程的 Python Notebook ：執行端對端 Lakehouse 分析管線，此檔案名為 notebook-getting-started-lakehouse-e2e.py ，其中包含下列內容：

# Databricks notebook source
external_location = "<your_external_location>"
catalog = "<your_catalog>"

dbutils.fs.put(f"{external_location}/foobar.txt", "Hello world!", True)
display(dbutils.fs.head(f"{external_location}/foobar.txt"))
dbutils.fs.rm(f"{external_location}/foobar.txt")

display(spark.sql(f"SHOW SCHEMAS IN {catalog}"))

# COMMAND ----------

from pyspark.sql.functions import col

# Set parameters for isolation in workspace and reset demo
username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
database = f"{catalog}.e2e_lakehouse_{username}_db"
source = f"{external_location}/e2e-lakehouse-source"
table = f"{database}.target_table"
checkpoint_path = f"{external_location}/_checkpoint/e2e-lakehouse-demo"

spark.sql(f"SET c.username='{username}'")
spark.sql(f"SET c.database={database}")
spark.sql(f"SET c.source='{source}'")

spark.sql("DROP DATABASE IF EXISTS ${c.database} CASCADE")
spark.sql("CREATE DATABASE ${c.database}")
spark.sql("USE ${c.database}")

# Clear out data from previous demo execution
dbutils.fs.rm(source, True)
dbutils.fs.rm(checkpoint_path, True)

# Define a class to load batches of data to source
class LoadData:

  def __init__(self, source):
    self.source = source

  def get_date(self):
    try:
      df = spark.read.format("json").load(source)
    except:
        return "2016-01-01"
    batch_date = df.selectExpr("max(distinct(date(tpep_pickup_datetime))) + 1 day").first()[0]
    if batch_date.month == 3:
      raise Exception("Source data exhausted")
      return batch_date

  def get_batch(self, batch_date):
    return (
      spark.table("samples.nyctaxi.trips")
        .filter(col("tpep_pickup_datetime").cast("date") == batch_date)
    )

  def write_batch(self, batch):
    batch.write.format("json").mode("append").save(self.source)

  def land_batch(self):
    batch_date = self.get_date()
    batch = self.get_batch(batch_date)
    self.write_batch(batch)

RawData = LoadData(source)

# COMMAND ----------

RawData.land_batch()

# COMMAND ----------

# Import functions
from pyspark.sql.functions import col, current_timestamp

# Configure Auto Loader to ingest JSON data to a Delta table
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .option("mergeSchema", "true")
  .toTable(table))

# COMMAND ----------

df = spark.read.table(table_name)

# COMMAND ----------

display(df)

針對適用於快速入門的 Python Notebook：使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 作業，此檔案名為notebook-quickstart-create-databricks-workspace-portal.py，其中包含下列內容：

# Databricks notebook source
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Seattle"
blob_sas_token = r""

# COMMAND ----------

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name,blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# COMMAND ----------

df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')

# COMMAND ----------

print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

如果您要建立筆記本，請建立另 notebook.auto.tfvars一個名為的檔案，並將下列內容新增至檔案。此檔案包含自定義筆記本設定的變數值。

針對適用於教學課程的 Python Notebook ：執行端對端 Lakehouse 分析管線：
```
notebook_subdirectory = "Terraform"
notebook_filename     = "notebook-getting-started-lakehouse-e2e.py"
notebook_language     = "PYTHON"
```
針對適用於快速入門的 Python Notebook：使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 作業：
```
notebook_subdirectory = "Terraform"
notebook_filename     = "notebook-quickstart-create-databricks-workspace-portal.py"
notebook_language     = "PYTHON"
```
如果您要在 Azure Databricks 工作區中建立筆記本，請務必藉由參考下列指示來設定筆記本成功執行的任何需求：
- 適用於教學課程的 Python 筆記本：執行端對端 Lakehouse 分析管線
- 適用於快速入門的 Python 筆記本：使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 作業

若要建立作業，請建立名為 job.tf的另一個檔案，並將下列內容新增至檔案。此內容會建立作業以執行筆記本。

variable "job_name" {
  description = "A name for the job."
  type        = string
  default     = "My Job"
}

resource "databricks_job" "this" {
  name = var.job_name
  existing_cluster_id = databricks_cluster.this.cluster_id
  notebook_task {
    notebook_path = databricks_notebook.this.path
  }
  email_notifications {
    on_success = [ data.databricks_current_user.me.user_name ]
    on_failure = [ data.databricks_current_user.me.user_name ]
  }
}

output "job_url" {
  value = databricks_job.this.url
}

如果您要建立作業，請建立名為 job.auto.tfvars的另一個檔案，並將下列內容新增至檔案。此檔案包含自訂作業組態的變數值。
```
job_name = "My Job"
```

步驟 2：執行設定

在此步驟中，您會執行 Terraform 組態，將叢集、筆記本和作業部署到您的 Azure Databricks 工作區。

執行命令來檢查您的 Terraform 組態是否有效 terraform validate 。如果報告任何錯誤，請修正錯誤，然後再次執行命令。
```
terraform validate
```
執行命令，查看 Terraform 會在您的工作區中執行哪些動作， terraform plan 然後再實際執行 Terraform。
```
terraform plan
```
執行 terraform apply 命令，將叢集、筆記本和作業部署至您的工作區。當系統提示您部署時，輸入 yes 並按 Enter。
```
terraform apply
```
Terraform 會部署專案中指定的資源。部署這些資源（特別是叢集）可能需要幾分鐘的時間。

步驟 3：探索結果

如果您建立叢集，請在命令的 terraform apply 輸出中，複製旁邊的 cluster_url連結，並將它貼到網頁瀏覽器的網址列中。
如果您已建立筆記本，請在命令的 terraform apply 輸出中，複製旁邊的 notebook_url連結，並將它貼到網頁瀏覽器的網址列中。

注意

使用筆記本之前，您可能需要自定義其內容。請參閱有關如何自定義筆記本的相關文件。
如果您已建立作業，請在命令的 terraform apply 輸出中，複製旁邊的 job_url連結，並將它貼到網頁瀏覽器的網址列中。

注意

執行筆記本之前，您可能需要自定義其內容。如需如何自定義筆記本的相關文件，請參閱本文開頭的連結。
如果您已建立作業，請執行作業，如下所示：
1. 按兩下 作業頁面上的 [立即 執行]。
2. 作業完成執行之後，若要檢視作業執行的結果，請在作業頁面上的 [已完成執行]（過去 60 天）清單中，按兩下 [開始時間] 資料行中的最近時間專案。 [ 輸出 ] 窗格會顯示執行筆記本程式代碼的結果。

步驟 4：清除

在此步驟中，您會從工作區中刪除上述資源。

執行命令，查看 Terraform 會在您的工作區中執行哪些動作， terraform plan 然後再實際執行 Terraform。
```
terraform plan
```
執行 terraform destroy 命令，從工作區刪除叢集、筆記本和作業。當系統提示您刪除時，輸入 yes 並按 Enter。
```
terraform destroy
```
Terraform 會刪除專案中指定的資源。

使用 Terraform 建立叢集、筆記本和作業

步驟 1：建立及設定 Terraform 專案

步驟 2：執行設定

步驟 3：探索結果

步驟 4：清除

其他資源