Understanding Azure Databricks data storage
When working on analytics projects, Azure Databricks is used to implement the ingestion and transformations layers. As a consequence, understanding the underlying storage and specific settings is paramount to accelerating performance and efficiency.
This section provides a brief summary of the available storage options and offers some important considerations for designing and setting up a data lake.
Using Delta Lake storage
Delta Lake is the default storage layer of Databricks that stores data and tables. Partitioning in Delta means that data is chunked into separate baskets (folder in blob storage). When this store is queried, just the relevant data from the baskets is loaded. Data in Delta Lake is stored as Parquet, a columnar file format. Columnar file formats are the recommended choice to speed up analytics results. These formats allow retrieval of the data only for the columns that are part of the query executed. Also, column level compression is highly effective compared to row level compression.
Databricks uses Spark as its engine. Ideally, data in Spark is stored in a smaller number of large files. The size ranges between 128 MB and 1 GB and allows efficient operation of the driver and worker nodes. Having the data spread over many small files uses up much of the memory when the driver tries to load all the file metadata at once. Such a load will stress the driver and slow down reading.
Further, Databricks Delta autoOptimize (Example: delta.autoOptimize.optimizeWrite
and delta.autoOptimize.autoCompact
. optimizeWrite
) features aim to maximize the throughput of data being written to storage. The autoCompact
feature can compact a file to the desired data size (default is 1 GB). The Compaction (bin-packing) operation is idempotent. For example, running compaction again on the same dataset has no effect.
Databricks Delta automatically collects the information (minimum and maximum values for each column) as metadata, while saving the data as Delta files. Databricks uses this meta information to achieve faster query execution by filtering the data.
Lastly, the use of the Z-Ordering (multi-dimensional clustering) feature of Databricks Delta can be used to speed up the retrieval of data. Z-Ordering co-locates related information in the same set of files.
The Z-Order is automatically used by Delta during data-skipping. Data-skipping helps reduce the amount of data Delta needs to read before finding the correct data.
Using Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 provides a mechanism that allows collected objects/files to be organized into a hierarchy of directories. This feature enhances performance, management, and security. For more information, see data lake storage introduction.
Atomic directory manipulation improves performance and data consistency for operations like moving directories, or deleting expired data for a given date range.
Azure Data Lake Storage Gen2 implements an access control model that supports both Azure role-based access control (Azure RBAC) and POSIX-like access control lists (ACLs).