Create clusters, notebooks, and jobs with Terraform

Artikel
10/04/2024

This article shows how to use the Databricks Terraform provider to create a cluster, a notebook, and a job in an existing Azure Databricks workspace.

This article is a companion to the following Azure Databricks getting started articles:

Tutorial: Run an end-to-end lakehouse analytics pipeline, which uses a cluster that works with Unity Catalog, a Python notebook, and a job to run the notebook.

Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal, which uses a general-purpose cluster and a Python notebook.

You can also adapt the Terraform configurations in this article to create custom clusters, notebooks, and jobs in your workspaces.

Step 1: Create and configure the Terraform project

Create a Terraform project by following the instructions in the Requirements section of the Databricks Terraform provider overview article.

To create a cluster, create a file named cluster.tf, and add the following content to the file. This content creates a cluster with the smallest amount of resources allowed. This cluster uses the lastest Databricks Runtime Long Term Support (LTS) version.

For a cluster that works with Unity Catalog:

variable "cluster_name" {}
variable "cluster_autotermination_minutes" {}
variable "cluster_num_workers" {}
variable "cluster_data_security_mode" {}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
  data_security_mode      = var.cluster_data_security_mode
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

For an all-purpose cluster:

variable "cluster_name" {
  description = "A name for the cluster."
  type        = string
  default     = "My Cluster"
}

variable "cluster_autotermination_minutes" {
  description = "How many minutes before automatically terminating due to inactivity."
  type        = number
  default     = 60
}

variable "cluster_num_workers" {
  description = "The number of workers."
  type        = number
  default     = 1
}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

To create a cluster, create another file named cluster.auto.tfvars, and add the following content to the file. This file contains variable values for customizing the cluster. Replace the placeholder values with your own values.

For a cluster that works with Unity Catalog:
```
cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1
cluster_data_security_mode      = "SINGLE_USER"
```
For an all-purpose cluster:
```
cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1
```

To create a notebook, create another file named notebook.tf, and add the following content to the file:

variable "notebook_subdirectory" {
  description = "A name for the subdirectory to store the notebook."
  type        = string
  default     = "Terraform"
}

variable "notebook_filename" {
  description = "The notebook's filename."
  type        = string
}

variable "notebook_language" {
  description = "The language of the notebook."
  type        = string
}

resource "databricks_notebook" "this" {
  path     = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}"
  language = var.notebook_language
  source   = "./${var.notebook_filename}"
}

output "notebook_url" {
 value = databricks_notebook.this.url
}

If you are creating a cluster, save the following notebook code to a file in the same directory as the notebook.tf file:

For the Python notebook for Tutorial: Run an end-to-end lakehouse analytics pipeline, a file named notebook-getting-started-lakehouse-e2e.py with the following contents:

# Databricks notebook source
external_location = "<your_external_location>"
catalog = "<your_catalog>"

dbutils.fs.put(f"{external_location}/foobar.txt", "Hello world!", True)
display(dbutils.fs.head(f"{external_location}/foobar.txt"))
dbutils.fs.rm(f"{external_location}/foobar.txt")

display(spark.sql(f"SHOW SCHEMAS IN {catalog}"))

# COMMAND ----------

from pyspark.sql.functions import col

# Set parameters for isolation in workspace and reset demo
username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
database = f"{catalog}.e2e_lakehouse_{username}_db"
source = f"{external_location}/e2e-lakehouse-source"
table = f"{database}.target_table"
checkpoint_path = f"{external_location}/_checkpoint/e2e-lakehouse-demo"

spark.sql(f"SET c.username='{username}'")
spark.sql(f"SET c.database={database}")
spark.sql(f"SET c.source='{source}'")

spark.sql("DROP DATABASE IF EXISTS ${c.database} CASCADE")
spark.sql("CREATE DATABASE ${c.database}")
spark.sql("USE ${c.database}")

# Clear out data from previous demo execution
dbutils.fs.rm(source, True)
dbutils.fs.rm(checkpoint_path, True)


# Define a class to load batches of data to source
class LoadData:

  def __init__(self, source):
    self.source = source

  def get_date(self):
    try:
      df = spark.read.format("json").load(source)
    except:
        return "2016-01-01"
    batch_date = df.selectExpr("max(distinct(date(tpep_pickup_datetime))) + 1 day").first()[0]
    if batch_date.month == 3:
      raise Exception("Source data exhausted")
      return batch_date

  def get_batch(self, batch_date):
    return (
      spark.table("samples.nyctaxi.trips")
        .filter(col("tpep_pickup_datetime").cast("date") == batch_date)
    )

  def write_batch(self, batch):
    batch.write.format("json").mode("append").save(self.source)

  def land_batch(self):
    batch_date = self.get_date()
    batch = self.get_batch(batch_date)
    self.write_batch(batch)

RawData = LoadData(source)

# COMMAND ----------

RawData.land_batch()

# COMMAND ----------

# Import functions
from pyspark.sql.functions import col, current_timestamp

# Configure Auto Loader to ingest JSON data to a Delta table
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .option("mergeSchema", "true")
  .toTable(table))

# COMMAND ----------

df = spark.read.table(table_name)

# COMMAND ----------

display(df)

For the Python notebook for Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal, a file named notebook-quickstart-create-databricks-workspace-portal.py with the following contents:

# Databricks notebook source
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Seattle"
blob_sas_token = r""

# COMMAND ----------

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name,blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# COMMAND ----------

df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')

# COMMAND ----------

print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

If you are creating a notebook, create another file named notebook.auto.tfvars, and add the following content to the file. This file contains variable values for customizing the notebook configuration.

For the Python notebook for Tutorial: Run an end-to-end lakehouse analytics pipeline:
```
notebook_subdirectory = "Terraform"
notebook_filename     = "notebook-getting-started-lakehouse-e2e.py"
notebook_language     = "PYTHON"
```
For the Python notebook for Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal:
```
notebook_subdirectory = "Terraform"
notebook_filename     = "notebook-quickstart-create-databricks-workspace-portal.py"
notebook_language     = "PYTHON"
```
If you are creating a notebook, in your Azure Databricks workspace, be sure to set up any requirements for the notebook to run successfully, by referring to the following instructions for:
- The Python notebook for Tutorial: Run an end-to-end lakehouse analytics pipeline
- The Python notebook for Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal

To create the job, create another file named job.tf, and add the following content to the file. This content creates a job to run the notebook.

variable "job_name" {
  description = "A name for the job."
  type        = string
  default     = "My Job"
}

variable "task_key" {
  description = "A name for the task."
  type        = string
  default     = "my_task"
}

resource "databricks_job" "this" {
  name = var.job_name
  task {
    task_key = var.task_key
    existing_cluster_id = databricks_cluster.this.cluster_id
    notebook_task {
      notebook_path = databricks_notebook.this.path
    }
  }
  email_notifications {
    on_success = [ data.databricks_current_user.me.user_name ]
    on_failure = [ data.databricks_current_user.me.user_name ]
  }
}

output "job_url" {
  value = databricks_job.this.url
}

If you are creating a job, create another file named job.auto.tfvars, and add the following content to the file. This file contains a variable value for customizing the job configuration.
```
job_name = "My Job"
task_key = "my_task"
```

Step 2: Run the configurations

In this step, you run the Terraform configurations to deploy the cluster, the notebook, and the job into your Azure Databricks workspace.

Check to see whether your Terraform configurations are valid by running the terraform validate command. If any errors are reported, fix them, and run the command again.
```
terraform validate
```
Check to see what Terraform will do in your workspace, before Terraform actually does it, by running the terraform plan command.
```
terraform plan
```
Deploy the cluster, the notebook, and the job into your workspace by running the terraform apply command. When prompted to deploy, type yes and press Enter.
```
terraform apply
```
Terraform deploys the resources that are specified in your project. Deploying these resources (especially a cluster) can take several minutes.

Step 3: Explore the results

If you created a cluster, in the output of the terraform apply command, copy the link next to cluster_url, and paste it into your web browser’s address bar.
If you created a notebook, in the output of the terraform apply command, copy the link next to notebook_url, and paste it into your web browser’s address bar.

Note

Before you use the notebook, you might need to customize its contents. See the related documentation about how to customize the notebook.
If you created a job, in the output of the terraform apply command, copy the link next to job_url, and paste it into your web browser’s address bar.

Note

Before you run the notebook, you might need to customize its contents. See the links at the beginning of this article for related documentation about how to customize the notebook.
If you created a job, run the job as follows:
1. Click Run now on the job page.
2. After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code.

Step 4: Clean up

In this step, you delete the preceding resources from your workspace.

Check to see what Terraform will do in your workspace, before Terraform actually does it, by running the terraform plan command.
```
terraform plan
```
Delete the cluster, the notebook, and the job from your workspace by running the terraform destroy command. When prompted to delete, type yes and press Enter.
```
terraform destroy
```
Terraform deletes the resources that are specified in your project.

Del via

Create clusters, notebooks, and jobs with Terraform

Step 1: Create and configure the Terraform project

Step 2: Run the configurations

Step 3: Explore the results

Step 4: Clean up

Feedback

Yderligere ressourcer