How to work with files on Azure Databricks

You can work with files on DBFS, the local driver node of the cluster, cloud object storage, external locations, and in Databricks Repos. You can integrate other systems, but many of these do not provide direct file access to Azure Databricks.

This article focuses on understanding the differences between interacting with files stored in the ephemeral volume storage attached to a running cluster and files stored in the DBFS root. You can directly apply the concepts shown for the DBFS root to mounted cloud object storage, because the /mnt directory is under the DBFS root. Most examples can also be applied to direct interactions with cloud object storage and external locations if you have the required privileges.

What is the root path for Azure Databricks?

The root path on Azure Databricks depends on the code executed.

The DBFS root is the root path for Spark and DBFS commands. These include:

  • Spark SQL
  • DataFrames
  • dbutils.fs
  • %fs

The block storage volume attached to the driver is the root path for code executed locally. This includes:

  • %sh
  • Most Python code (not PySpark)
  • Most Scala code (not Spark)

Note

In Databricks Runtime 14.0 and above, the the default current working directory (CWD) for code executed locally is the directory containing the notebook or script being run. This is a change in behavior from Databricks Runtime 13.3 LTS and below. See What is the default current working directory in Databricks Runtime 14.0 and above?.

Access files on the DBFS root

When using commands that default to the DBFS root, you can use the relative path or include dbfs:/.

SELECT * FROM parquet.`<path>`;
SELECT * FROM parquet.`dbfs:/<path>`
df = spark.read.load("<path>")
df.write.save("<path>")
dbutils.fs.<command> ("<path>")
%fs <command> /<path>

When using commands that default to the driver volume, you must use /dbfs before the path.

%sh <command> /dbfs/<path>/
import os
os.<command>('/dbfs/<path>')

Access files on the driver filesystem

When using commands that default to the driver storage, you can provide a relative or absolute path.

%sh <command> /<path>
import os
os.<command>('/<path>')

When using commands that default to the DBFS root, you must use file:/.

dbutils.fs.<command> ("file:/<path>")
%fs <command> file:/<path>

Because these files live on the attached driver volumes and Spark is a distributed processing engine, not all operations can directly access data here. If you need to move data from the driver filesystem to DBFS, you can copy files using magic commands or the Databricks utilities.

dbutils.fs.cp ("file:/<path>", "dbfs:/<path>")
%sh cp /<path> /dbfs/<path>
%fs cp file:/<path> /<path>

Understand default locations with examples

The table and diagram summarize and illustrate the commands described in this section and when to use each syntax.

Note

Commands leveraging open source or driver-only execution use FUSE to access data in cloud object storage. Adding /dbfs to the file path automatically uses the DBFS implementation of FUSE.

Command Default location To read from DBFS root To read from local filesystem
%fs DBFS root Add file:/ to path
%sh Local driver node Add /dbfs to path
dbutils.fs DBFS root Add file:/ to path
os.<command> or other local code Local driver node Add /dbfs to path
spark.[read/write] DBFS root Not supported

File paths diagram

# Default location for %fs is root
%fs ls /tmp/
%fs mkdirs /tmp/my_cloud_dir
%fs cp /tmp/test_dbfs.txt /tmp/file_b.txt
# Default location for dbutils.fs is root
dbutils.fs.ls ("/tmp/")
dbutils.fs.put("/tmp/my_new_file", "This is a file in cloud storage.")
# Default location for %sh is the local filesystem
%sh ls /dbfs/tmp/
# Default location for os commands is the local filesystem
import os
os.listdir('/dbfs/tmp')
# With %fs and dbutils.fs, you must use file:/ to read from local filesystem
%fs ls file:/tmp
%fs mkdirs file:/tmp/my_local_dir
dbutils.fs.ls ("file:/tmp/")
dbutils.fs.put("file:/tmp/my_new_file", "This is a file on the local driver node.")
# %sh reads from the local filesystem by default
%sh ls /tmp

Access files on mounted object storage

Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.

dbutils.fs.ls("/mnt/mymount")
df = spark.read.format("text").load("dbfs:/mnt/mymount/my_file.txt")

Local file API limitations

The following lists the limitations in local file API usage with DBFS root and mounts in Databricks Runtime.

  • Does not support credential passthrough.
  • Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to /dbfs. For example:
# python
import xlsxwriter
from shutil import copyfile

workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()

copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
  • No sparse files. To copy sparse files, use cp --sparse=never:
$ cp sparse.file /dbfs/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /dbfs/sparse.file