What is the Databricks Lakehouse?
The Databricks Lakehouse combines the ACID transactions and data governance of data warehouses with the flexibility and cost-efficiency of data lakes to enable business intelligence (BI) and machine learning (ML) on all data. The Databricks Lakehouse keeps your data in your massively scalable cloud object storage in open source data standards, allowing you to use your data however and wherever you want.
- What are ACID guarantees on Azure Databricks?
- What is the medallion lakehouse architecture?
- What does it mean to build a single source of truth?
- Data discovery and collaboration in the lakehouse
- Data objects in the Databricks Lakehouse
Components of the Databricks Lakehouse
The primary components of the Databricks Lakehouse are:
By storing data with Delta Lake, you enable downstream data scientists, analysts, and machine learning engineers to leverage the same production data supporting your core ETL workloads as soon as data is processed.
Unity Catalog ensures that you have complete control over who gains access to which data and provides a centralized mechanism for managing all data governance and access controls without needing to replicate your data.
Tables created on Azure Databricks use the Delta Lake protocol by default. When you create a new Delta table:
- Metadata used to reference the table is added to the metastore in the declared schema or database.
- Data and table metadata are saved to a directory in cloud object storage.
The metastore reference to a Delta table is technically optional; you can create Delta tables by directly interacting with directory paths using Spark APIs. Some new features that build upon Delta Lake will store additional metadata in the table directory, but all Delta tables have:
- A directory containing table data in the Parquet file format.
- A sub-directory
/_delta_logthat contains metadata about table versions in JSON and Parquet format.
Learn more about Data objects in the Databricks Lakehouse.
Unity Catalog unifies data governance and discovery on Azure Databricks. Available in notebooks, jobs, and Databricks SQL, Unity Catalog provides features and UIs that enable workloads and users designed for both data lakes and data warehouse.
- Account-level management of the Unity Catalog metastore means databases, data objects, and permissions can be shared across Azure Databricks workspaces.
- You can leverage three tier namespacing (
<catalog>.<database>.<table>) for organizing and granting access to data.
- External locations and storage credentials are also securable objects with similar ACL setting to other data objects.
- The Data Explorer provides a graphical user interface to explore databases and manage permissions.
Data lakehouse vs. data warehouse vs. data lake
Data warehouses have powered business intelligence (BI) decisions for about 30 years, having evolved as set of design guidelines for systems controlling the flow of data. Data warehouses optimize queries for BI reports, but can take minutes or even hours to generate results. Designed for data that is unlikely to change with high frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data warehouses rely on proprietary formats, which often limit support for machine learning.
Powered by technological advances in data storage and driven by exponential increases in the types and volume of data, data lakes have come into widespread use over the last decade. Data lakes store and process data cheaply and efficiently. Data lakes are often defined in opposition to data warehouses: A data warehouse delivers clean, structured data for BI analytics, while a data lake permanently and cheaply stores data of any nature in any format. Many organizations use data lakes for data science and machine learning, but not for BI reporting due to its unvalidated nature.
The data lakehouse replaces the current dependency on data lakes and data warehouses for modern data companies that desire:
- Open, direct access to data stored in standard data formats.
- Indexing protocols optimized for machine learning and data science.
- Low query latency and high reliability for BI and advanced analytics.
By combining an optimized metadata layer with validated data stored in standard formats in cloud object storage, the data lakehouse allows data scientists and ML engineers to build models from the same data driving BI reports.