Migrate ETL pipelines to Azure Databricks

Article
08/09/2024

This article provides an overview of options for migrating extract, transform, load (ETL) pipelines running on other data systems to Azure Databricks. If you are migrating Apache Spark code, see Adapt your exisiting Apache Spark code for Azure Databricks.

For general information about moving from an enterprise data warehouse to a lakehouse, see Migrate your data warehouse to the Databricks lakehouse. For information about moving from Parquet to Delta Lake, see Migrate a Parquet data lake to Delta Lake.

Can you run Hive pipelines on Azure Databricks?

Most Hive workloads can run on Azure Databricks with minimal refactoring. The version of Spark SQL supported by Databricks Runtime allows many HiveQL constructs. See Apache Hive compatibility. Azure Databricks includes a Hive metastore by default. Most Hive migrations need to address a few primary concerns:

Hive SerDe need to be updated to use Azure Databricks-native file codecs. (Change DDL from STORED AS to USING to use Azure Databricks SerDe.)
Hive UDFs must either be installed to clusters as libraries or refactored to native Spark. Because Hive UDFs are already in the JVM, they might provide sufficient performance for many workloads. See Performance considerations.
The directory structure for tables should be altered, as Azure Databricks uses partitions differently than Hive. See When to partition tables on Azure Databricks.

If you choose to update your tables to Delta Lake during your initial migration, a number of DDL and DML statements are unsupported. These include:

ROWFORMAT
SERDE
OUTPUTFORMAT
INPUTFORMAT
COMPRESSION
STORED AS
ANALYZE TABLE PARTITION
ALTER TABLE [ADD|DROP] PARTITION
ALTER TABLE RECOVER PARTITIONS
ALTER TABLE SET SERDEPROPERTIES
CREATE TABLE LIKE
INSERT OVERWRITE DIRECTORY
LOAD DATA
Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE

Can you run SQL ETL pipelines on Azure Databricks?

Migrating SQL workloads from other systems to Azure Databricks usually requires very little refactoring, depending on the extent to which system specific protocols were used in the source code. Azure Databricks uses Delta Lake as the default table format, so tables are created with transactional guarantees by default.

Spark SQL is mostly ANSI-compliant, but some differences in behavior might exist. See How is the Databricks Data Intelligence Platform different than an enterprise data warehouse?.

Because data systems tend to configure access to external data differently, much of the work refactoring SQL ETL pipelines might be configuring access to these data sources and then updating your logic to use these new connections. Azure Databricks provides options for connecting to many data sources for ingestion.

Can you run dbt ETL pipelines on Azure Databricks?

Azure Databricks provides a native integration with dbt, allowing you to leverage existing dbt scripts with very little refactoring.

Delta Live Tables provides an optimized Databricks-native declarative SQL syntax for creating, testing, and deploying pipelines. While you can leverage dbt on Azure Databricks, a light refactor of code to Delta Live Tables might lower your total cost to operate your pipelines on Azure Databricks. See What is Delta Live Tables?.

Can you migrate serverless cloud functions to Azure Databricks?

The extensibility and versatility of custom serverless cloud functions makes it difficult to provide a common recommendation, but one of the most common use cases for these functions is waiting for files or data to appear in a location or message queue and then performing some action as a result. While Azure Databricks does not support complex logic for triggering workloads based on cloud conditions, you can use Structured Streaming in conjunction with Jobs to process data incrementally.

Use Auto Loader for optimized data ingestion from cloud object storage. Structured Streaming can process data from streaming sources in near-real time.

Can you run syntax from other data systems on Azure Databricks?

ETL pipelines defined in languages other than SQL, Apache Spark, or Hive might need to be heavily refactored before running on Azure Databricks. Azure Databricks has experience helping customers migrate from most of the data systems in use today, and might have resources available to jumpstart your migration efforts.

Share via