Where does Azure Databricks write data?

This article details locations Azure Databricks writes data with common operations and configurations. Because Azure Databricks provides a suite of tools that span many technologies and interact with cloud resources in a shared-responsibility model, the default locations used to store data vary based on the execution environment, configurations, and libraries.

The information in this article is meant to help you understand default paths for various operations and how configurations might alter these defaults. Data stewards and administrators looking for guidance on configuring and controlling access to data should see Data governance with Unity Catalog.

To learn about configuring object storage and other data source, see Connect to data sources.

What is object storage?

In cloud computing, object storage or blob storage refers to storage containers that maintain data as objects, with each object consisting of data, metadata, and a globally unique resource identifier (URI). Data manipulation operations in object storage are often limited to create, read, update, and delete (CRUD) through a REST API interface. Some object storage offerings include features like versioning and lifecycle management. Object storage has the following benefits:

  • High availability, durability, and reliability.
  • Lower cost for storage compared to most other storage options.
  • Infinitely scalable (limited by the total amount of storage available in a given region of the cloud).

Most cloud-based data lakes are built on top of open source data formats in cloud object storage.

How does Azure Databricks use object storage?

Object storage is the main form of storage used by Azure Databricks for most operations. The Databricks Filesystem (DBFS) allows Azure Databricks users to interact with files in object storage similar to how they would in any other file system. Unless you specifically configure a table against an external data system, all tables created in Azure Databricks store data in cloud object storage.

Delta Lake files stored in cloud object storage provide the data foundation for the Databricks lakehouse.

What is block storage?

In cloud computing, block storage or disk storage refer to storage volumes that correspond to traditional hard disk drives (HDDs) or solid state drives (SSDs), also known simply as “hard drives”. When deploying block storage in a cloud computing environment, typically a logical partition of one or more physical drives are deployed. Implementations vary slightly between product offerings and cloud vendors, but the following characteristics are typically found across implementations:

  • All virtual machines (VMs) require an attached block storage volume.
  • Files and programs installed to a block storage volume persist as long as the block storage volume persists.
  • Block storage volumes are often used for temporary data storage.
  • Block storage volumes attached to VMs are usually deleted alongside VMs.

How does Azure Databricks use block storage?

When you turn on compute resources, Azure Databricks configures and deploys VMs and attaches block storage volumes. This block storage is used for storing ephemeral data files for the lifetime of the compute. These files include the operating system and installed libraries, in addition to data used by the disk cache. While Apache Spark uses block storage in the background for efficient parallelization and data loading, most code run on Azure Databricks does not directly save or load data to block storage.

You can run arbitrary code such as Python or Bash commands that use the block storage attached to your driver node. See Work with files in ephemeral storage attached to the driver node.

Where does Unity Catalog store data files?

Unity Catalog relies on administrators to configure relationships between cloud storage and relational objects. The exact location where data resides depends on how administrators have configured relations.

Data written or uploaded to objects governed by Unity Catalog is stored in one of the following locations:

  • A managed storage location associated with a metastore, catalog, or schema. Data written or uploaded to managed tables and managed volumes use managed storage. See Managed storage.
  • An external location configured with storage credentials. Data written or uploaded to external tables and external volumes use external storage. See Connect to cloud object storage using Unity Catalog.

Where does Databricks SQL store data backing tables?

When you run a CREATE TABLE statement with Databricks SQL configured with Unity Catalog, the default behavior is to store data files in a managed storage location configured with Unity Catalog. See Where does Unity Catalog store data files?.

The legacy hive_metastore catalog follows different rules. See Work with Unity Catalog and the legacy Hive metastore.

Where does Delta Live Tables store data files?

Databricks recommends using Unity Catalog when creating DLT pipelines. Data is stored in directories within the managed storage location associated with the target schema.

You can optionally configure DLT pipelines using Hive metastore. When configured with Hive metastore, you can specify a storage location on DBFS or cloud object storage. If you do not specify a location, a location on the DBFS root is assigned to your pipeline.

Where does Apache Spark write data files?

Databricks recommends using object names with Unity Catalog for reading and writing data. You can also write files to Unity Catalog volumes using the following pattern: /Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>. You must have sufficient privileges to upload, create, update, or insert data to Unity Catalog-governed objects.

You can optionally use universal resource indicators (URIs) to specify paths to data files. URIs vary depending on the cloud provider. You must also have write permissions configured for your current compute to write to cloud object storage.

Azure Databricks uses the Databricks Filesystem to map Apache Spark read and write commands back to cloud object storage. Each Azure Databricks workspace comes with a DBFS root storage location configured in the cloud account allocated for the workspace, which all users can access for reading and writing data. Databricks does not recommend using the DBFS root for storing any production data. See What is the Databricks File System (DBFS)? and Recommendations for working with DBFS root.

Where does pandas write data files on Azure Databricks?

In Databricks Runtime 14.0 and above, the default current working directory (CWD) for all local Python read and write operations is the directory containing the notebook. If you provide only a filename when saving a data file, pandas saves that data file as a workspace file parallel to your currently running notebook.

Not all Databricks Runtime versions support workspace files, and some Databricks Runtime versions have differing behavior depending on whether you use notebooks or Git folders. See What is the default current working directory?.

Where should I write temporary files on Azure Databricks?

If you must write temporary files that you do not want to keep after the cluster is shut down, writing the temporary files to $TEMPDIR yields better performance than writing to the current working directory (CWD) if the CWD is in workspace filesystem. You can also avoid exceeding branch size limits if the code runs in a Repo. For more information, see File and repo size limits.

Write to /local_disk0 if the amount of data to be written is very large and you want the storage to autoscale.