Migrate your data warehouse to the Databricks lakehouse
This article describes some of the considerations and caveats to consider as you replace your enterprise data warehouse with the Databricks lakehouse. Most workloads, queries, and dashboards defined in enterprise data warehouses can run with minimal code refactoring once admins have completed the initial data migration and governance configuration. Migrating your data warehousing workloads to Azure Databricks is not about eliminating data warehousing, but rather unifying your data ecosystem. For more on data warehousing on Databricks, see What is data warehousing on Azure Databricks?.
Many Apache Spark workloads extract, transform, and load (ETL) data from source systems into data warehouses to power downstream analytics. Replacing your enterprise data warehouse with a lakehouse enables analysts, data scientists, and data engineers to work against the same tables in the same platform, reducing the overall complexity, maintenance requirements, and total cost of ownership. See What is a data lakehouse?. For more on data warehousing on Databricks, see What is data warehousing on Azure Databricks?.
Load data into the lakehouse
Azure Databricks provides a number of tools and capabilities to make it easy to migrate data to the lakehouse and configure ETL jobs to load data from diverse data sources. The following articles introduce these tools and options:
- Migrate a Parquet data lake to Delta Lake
- What is Lakehouse Federation?
- What is Databricks Partner Connect?
- Ingest data into a Databricks lakehouse
- What is Delta Live Tables?
How is the Databricks Data Intelligence Platform different than an enterprise data warehouse?
The Databricks Data Intelligence Platform is built on top of Apache Spark, Unity Catalog, and Delta Lake, providing native support for big data workloads for analytics, ML, and data engineering. All enterprise data systems have slightly different transactional guarantees, indexing and optimization patterns, and SQL syntax. Some of the biggest differences you might discover include the following:
- All transactions are table-level. There are no database-level transactions, locks, or guarantees.
- There are no
BEGIN
andEND
constructs, meaning each statement or query runs as a separate transaction. - Three tier namespacing uses
catalog.schema.table
pattern. The termsdatabase
andschema
are synonymous due to legacy Apache Spark syntax. - Primary key and foreign key constraints are informational only. Constraints can only be enforced at a table level. See Constraints on Azure Databricks.
- Native data types supported in Azure Databricks and Delta Lake might differ slightly from source systems. Required precision for numeric types should be clearly indicated before target types are chosen.
The following articles provide additional context on important considerations: