Frequently asked questions (FAQ)

This article provides answers to frequently asked questions about Delta Lake.

What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake on Azure Databricks allows you to configure Delta Lake based on your workload patterns and provides optimized layouts and indexes for fast interactive queries.

Delta Lake sits on top of Apache Spark. The format and the compute layer helps to simplify building big data pipelines and increase the overall efficiency of your pipelines.

What format does Delta Lake use to store data?

Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

How can I read and write data with Delta Lake?

You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and Write to a table.

Where does Delta Lake store the data?

When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that location in Parquet format.

Why isn’t OPTIMIZE automatic?

The OPTIMIZE operation starts up many Spark jobs in order to optimize the file sizing via compaction (and optionally perform Z-Ordering). Since much of what OPTIMIZE does is compact small files, you must first accumulate many small files before this operation has an effect. Therefore, the OPTIMIZE operation is not run automatically.

Moreover, running OPTIMIZE, especially with ZORDER, is an expensive operation in time and resources. If Databricks ran OPTIMIZE automatically or waited to write out data in batches, it would remove the ability to run low-latency Delta Lake streams (where a Delta table is the source). Many customers have Delta tables that are never optimized because they only stream data from these tables, obviating the query benefits that OPTIMIZE would provide.

Lastly, Delta Lake automatically collects statistics about the files that are written to the table (whether through an OPTIMIZE operation or not). This means that reads from Delta tables leverage this information whether or not the table or a partition has had the OPTIMIZE operation run on it.

How often should I run OPTIMIZE?

When you choose how often to run OPTIMIZE, there is a trade-off between performance and cost. You should run OPTIMIZE more often if you want better end-user query performance (necessarily at a higher cost because of resource usage). You should run it less often if you want to optimize cost.

We recommend you start by running OPTIMIZE on a daily basis. Then modify your job from there.

What’s the best instance type to run OPTIMIZE (bin-packing and Z-Ordering) on?

Both operations are CPU intensive operations doing large amounts of Parquet decoding and encoding.

We recommend Compute Optimized instance types. OPTIMIZE also benefits from attached SSDs.

Can I copy my Delta Lake table to another location?

Yes, you can copy your Delta Lake table to another location. Copy your files without changing the timestamps to ensure that the time travel with timestamps is consistent.

Can I stream data directly into and from Delta tables?

Yes, you can use Structured Streaming to directly write data into Delta tables and read from Delta tables. See Stream data into Delta tables and Stream data from Delta tables.

Does Delta Lake support writes or reads using the Spark Streaming DStream API?

Delta does not support the DStream API. We recommend Table streaming reads and writes.

When I use Delta Lake, will I be able to port my code to other Spark platforms easily?

Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other Spark platforms. To port your code, replace delta format with parquet format.

How do Delta tables compare to Hive SerDe tables?

Delta Lake manages the following Hive SerDe parameters on your behalf. Azure Databricks recommends that you don’t manually specify these values.


What DDL and DML features does Delta Lake not support?

  • Unsupported DDL features:
  • Unsupported DML features:
    • INSERT INTO [OVERWRITE] table with static partitions
    • INSERT OVERWRITE TABLE for table with dynamic partitions
    • Bucketing
    • Specifying a schema when reading from a table
    • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE

Does Delta Lake support multi-table transactions?

Delta Lake does not support multi-table transactions. Delta Lake supports transactions at the table level.

Primary key and foreign key relationship on Azure Databricks are informational and not enforced. See Declare primary key and foreign key relationships.

How can I change the type of a column?

Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change column type.

How do I find the last commit’s version in the Spark session?

To get the version number of the last commit written by the current SparkSession across all threads and all tables, query the SQL configuration







If no commits have been made by the SparkSession, querying the key returns an empty value.


If you share the same SparkSession across multiple threads, it’s similar to sharing a variable across multiple threads; you may hit race conditions as the configuration value is updated concurrently.

What does it mean that Delta Lake supports multi-cluster writes?

It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.

Can I modify a Delta table from different workspaces?

Yes, you can concurrently modify the same Delta table from different workspaces. Moreover, if one process is writing from a workspace, readers in other workspaces will see a consistent view.

Can I access Delta tables outside of Databricks Runtime?

There are two cases to consider: external reads and external writes.

  • External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools that understand this format to read the data. However, since other tools do not support the Delta Lake transaction log, it is likely that they will incorrectly read stale deleted data, uncommitted data, or the partial results of failed transactions.

    In cases where the data is static (that is, there are no active jobs writing to the table), you can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of the table. This operation puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

    However, Delta Lake relies on stale snapshots for the following functionality, which will fail when using VACUUM with zero retention allowance:

    • Snapshot isolation for readers: Long running jobs continue to read a consistent snapshot from the moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention less than the length of these jobs can cause them to fail with a FileNotFoundException.
    • Streaming from Delta tables: Streams read from the original files written into a table in order to ensure exactly-once processing. When combined with OPTIMIZE, VACUUM with zero retention can remove these files before the stream has time to process them, causing it to fail.

    For these reasons, Databricks recommends using this technique only on static data sets that must be read by external tools.

  • External writes: Delta Lake maintains additional metadata in a transaction log to enable ACID transactions and snapshot isolation for readers. To ensure that the transaction log is updated correctly and the proper validations are performed, writer implementations must strictly adhere to the Delta Transaction Protocol. Delta Lake in Databricks Runtime ensures ACID guarantees based on the Delta Transaction Protocol. Whether non-Spark Delta connectors that write to Delta tables can write with ACID guarantees depends on the connector implementation.