Del via


Recommendations for files in volumes and workspace files

When you upload or save data or files to Azure Databricks, you can choose to store these files using Unity Catalog volumes or workspace files. This article contains recommendations and requirements for using these locations. For more details on volumes and workspace files, see What are Unity Catalog volumes? and What are workspace files?.

Databricks recommends using Unity Catalog volumes to store data, libraries, and build artifacts. Store notebooks, SQL queries, and code files as workspace files. You can configure workspace file directories as Git folders to sync with remote Git repositories. See Git integration with Databricks Git folders. Small data files used for test scenarios can also be stored as workspace files.

The tables below provide specific recommendations for files, depending on your type of file or feature needs.

Important

The Databricks File System (DBFS) is also available for file storage, but is not recommended, as all workspace users have access to files in DBFS. See DBFS.

File types

The following table provides storage recommendations for file types. Databricks supports many file formats beyond what are provided in this table as examples.

File type Recommendation
Databricks objects, such as notebooks and queries Store as workspace files
Structured data files, such as Parquet files and ORC files Store in Unity Catalog volumes
Semi-structured data files, such as text files (.csv, .txt) and JSON files (.json) Store in Unity Catalog volumes
Unstructured data files, such as image files (.png, .svg), audio files (.mp3), and document files (.pdf, .docx) Store in Unity Catalog volumes
Raw data files used for adhoc or early data exploration Store in Unity Catalog volumes
Operational data, such as log files Store in Unity Catalog volumes
Large archive files, such as ZIP files (.zip) Store in Unity Catalog volumes
Source code files, such as Python files (.py), Java files (.java), and Scala files (.scala) Store as workspace files, if applicable, with other related objects, such as notebooks and queries.

Databricks recommends managing these files in a Git folder for version control and change tracking of these files.
Build artifacts and libraries, such as Python wheels (.whl) and JAR files (.jar) Store in Unity Catalog volumes
Configuration files Store configuration files needed across workspaces in Unity Catalog volumes, but store them as workspace files if they are project files in a Git folder.

Feature comparison

The following table compares the feature offerings of workspace files and Unity Catalog volumes.

Feature Workspace files Unity Catalog volumes
File access Workspace files are only accessible to each other within the same workspace. Files are globally accessible across workspaces.
Programmatic access Files can be accessed using:

* Spark APIs
* FUSE
* dbutils
* REST API
* Databricks SDKs
* Databricks CLI
Files can be accessed using:

* Spark APIs
* FUSE
* dbutils
* REST API
* Databricks SDKs
* Databricks SQL Connectors
* Databricks CLI
* Databricks Terraform Provider
Databricks Asset Bundles By default, all files in a bundle, which includes libraries and Databricks objects such as notebooks and queries, are deployed securely as workspace files. Permissions are defined in the bundle configuration. Bundles can be customized to include libraries already in volumes when the libraries exceed the size limit of workspace files. See Databricks Asset Bundles library dependencies.
File permission level Permissions are at the Git-folder level if the file is in a Git folder, otherwise permissions are set at the file level. Permissions are at the volume level.
Permissions management Permissions are managed by workspace ACLs and are limited to the containing workspace. Metadata and permissions are managed by Unity Catalog. These permissions are applicable across all workspaces that have access to the catalog.
External storage mount Does not support mounting external storage Provides the option to point to pre-existing datasets on external storage by creating an external volume. See What are Unity Catalog volumes?.
UDF support Not supported Writing from UDFs is supported using Volumes FUSE
File size Store smaller files less than 500MB, such as source code files (.py, .md, .yml) needed alongside notebooks. Store very large data files at limits determined by cloud service providers.
Upload & download Support for upload and download up to 10MB. Support for upload and download up to 5GB.
Table creation support Tables cannot be created with workspace files as the location. Tables can be created from files in a volume by running COPY INTO, Autoloader, or other options described in Ingest data into a Databricks lakehouse.
Directory structure & file paths Files are organized in nested directories, each with its own permission model:

* User home directories, one for each user and service principal in the workspace
* Git folders
* Shared
Files are organized in nested directories inside a volume

See How can you access data in Unity Catalog?.
File history Use Git folder within workspaces to track file changes. Audit logs are available.