Tips, recommendations, and features for developing and testing Delta Live Tables pipelines

Artikel
10/14/2024

This article describes patterns you can use to develop and test Delta Live Tables pipelines. Through the pipeline settings, Delta Live Tables allows you to specify configurations to isolate pipelines in developing, testing, and production environments. This article’s recommendations apply to SQL and Python code development.

Databricks recommends using parameters to write extensible Delta Live Tables code. This is especially useful for writing portable code between dev, test, and prod environments. For example, you can point your pipelines at test data with known issues to test code resiliency. See Control data sources with parameters.

Use development mode to run pipeline updates

Delta Live Tables has a UI toggle to control whether your pipeline updates run in development or production mode. This mode controls how pipeline updates are processed, including:

Development mode does not immediately terminate compute resources after an update succeeds or fails. You can reuse the same compute resources to run multiple pipeline updates without waiting for a cluster to start.
Development mode does not automatically retry on task failure, allowing you to detect and fix logical or syntactic errors in your pipeline immediately.

Databricks recommends using development mode during development and testing. However, when deploying to a production environment, always switch to production mode.

See Development and production modes.

Test pipeline source code without waiting for tables to update

To check for problems with your pipeline source code, such as syntax and analysis errors, during development and testing, you can run a Validate update. Because a Validate update only verifies the correctness of pipeline source code without running an actual update on any tables, you can more quickly identify and fix issues before running an actual pipeline update.

Specify a target schema during all development lifecycle phases

All datasets in a Delta Live Tables pipeline reference the LIVE virtual schema, which is inaccessible outside the pipeline. If a target schema is specified, the LIVE virtual schema points to the target schema. You must specify a target schema to review the results written out to each table during an update.

You must specify a target schema that is unique to your environment. Each table in a given schema can only be updated by a single pipeline.

You can isolate these environments by creating separate pipelines for development, testing, and production with different targets. Using the target schema parameter allows you to remove logic that uses string interpolation or other widgets or parameters to control data sources and targets.

See Use Unity Catalog with your Delta Live Tables pipelines and Use Delta Live Tables pipelines with legacy Hive metastore.

Use Databricks Git folders to manage Delta Live Tables pipelines

Databricks recommends using Git folders during Delta Live Tables pipeline development, testing, and deployment to production. Git folders enables the following:

Keeping track of how code is changing over time.
Merging changes that multiple developers are making.
Software development practices such as code reviews.

Databricks recommends configuring a single Git repository for all code related to a pipeline.

Each developer should have their own Databricks Git folder configured for development. During development, the user configures their own pipeline from their Databricks Git folder and tests new logic using development datasets and isolated schema and locations. After completing development work, the user commits and pushes changes back to their branch in the central Git repository and opens a pull request against the testing or QA branch.

The resulting branch should be checked out in a Databricks Git folder, and a pipeline should be configured using test datasets and a development schema. Assuming logic runs as expected, a pull request or release branch should be prepared to push the changes to production.

While Git folders can be used to synchronize code across environments, pipeline settings must be kept up to date manually or using tools like Terraform.

This workflow is similar to using Git folders for CI/CD in all Databricks jobs. See CI/CD techniques with Git and Databricks Git folders (Repos).

Segment source code for ingestion and transformation steps

Databricks recommends isolating queries that ingest data from transformation logic that enriches and validates data. If you organize source code for ingesting data from development or testing data sources in a directory separate from production data ingestion logic, you can configure pipelines for various environments with datasets specific to those environments. For example, you can accelerate development by using smaller datasets for testing. See Create sample datasets for development and testing.

You can also use parameters to control data sources for development, testing, and production. See Use parameters with Delta Live Tables pipelines.

Because Delta Live Tables pipelines use the LIVE virtual schema for managing all dataset relationships, by configuring development and testing pipelines with ingestion source code that loads sample data, you can substitute sample datasets using production table names to test code. The same transformation logic can be used in all environments.

Create sample datasets for development and testing

Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:

Select a subset of data from a production dataset.
Use anonymized or artificially generated data for sources containing PII.
Create test data with well-defined outcomes based on downstream transformation logic.
Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.

For example, if you have a notebook that defines a dataset using the following code:

CREATE OR REFRESH STREAMING TABLE input_data AS SELECT * FROM read_files("/production/data", "json")

You could create a sample dataset containing specific records using a query like the following:

CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading

The following example demonstrates filtering published data to create a subset of the production data for development or testing:

CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the LIVE.input_data dataset but is configured to include the notebook that creates the dataset specific to the environment.

Del via