What is the Databricks File System (DBFS)?
The Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage API calls.
Azure Databricks workspaces deploy with a DBFS root volume, accessible to all users by default. Databricks recommends against storing production data in this location.
What can you do with DBFS?
DBFS provides convenience by mapping cloud object storage URIs to relative paths.
- Allows you to interact with object storage using directory and file semantics instead of cloud-specific API commands.
- Allows you to mount cloud object storage locations so that you can map storage credentials to paths in the Azure Databricks workspace.
- Simplifies the process of persisting files to object storage, allowing virtual machines and attached volume storage to be safely deleted on cluster termination.
- Provides a convenient location for storing init scripts, JARs, libraries, and configurations for cluster initialization.
- Provides a convenient location for checkpoint files created during model training with OSS deep learning libraries.
DBFS provides many options for interacting with files in cloud object storage:
- How to work with files on Azure Databricks
- List, move, copy, and delete files with Databricks Utilities
- Browse files in DBFS
- Upload files to DBFS with the UI
- Interact with DBFS files using the Databricks CLI
- Interact with DBFS files using the Databricks REST API
Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system. Mounts store Hadoop configurations necessary for accessing storage, so you do not need to specify these settings in code or during cluster configuration.
For more information, see Mounting cloud object storage on Azure Databricks.
The DBFS root is the default storage location for an Azure Databricks workspace, provisioned as part of workspace creation in the cloud account containing the Azure Databricks workspace. For details on DBFS root configuration and deployment, see the Azure Databricks quickstart.
Some users of Azure Databricks may refer to the DBFS root as “DBFS” or “the DBFS”; it is important to differentiate that DBFS is a file system used for interacting with data in cloud object storage, and the DBFS root is a cloud object storage location. You use DBFS to interact with the DBFS root, but they are distinct concepts, and DBFS has many applications beyond the DBFS root.
The DBFS root contains a number of special locations that serve as defaults for various actions performed by users in the workspace. For details, see What directories are in DBFS root by default?.
How does DBFS work with Unity Catalog?
Unity Catalog adds the concepts of external locations and managed storage credentials to help organizations provide least privileges access to data in cloud object storage. Unity Catalog also provides a new default storage location for managed tables. Some security configurations provide direct access to both Unity Catalog-managed resources and DBFS. Databricks has compiled recommendations for using DBFS and Unity Catalog.