How to work with files on Azure Databricks
You can work with files on DBFS, the local driver node of the cluster, cloud object storage, external locations, and in Databricks Repos. You can integrate other systems, but many of these do not provide direct file access to Azure Databricks.
This article focuses on understanding the differences between interacting with files stored in the ephemeral volume storage attached to a running cluster and files stored in the DBFS root. You can directly apply the concepts shown for the DBFS root to mounted cloud object storage, because the
/mnt directory is under the DBFS root. Most examples can also be applied to direct interactions with cloud object storage and external locations if you have the required privileges.
What is the root path for Azure Databricks?
The root path on Azure Databricks depends on the code executed.
The DBFS root is the root path for Spark and DBFS commands. These include:
- Spark SQL
The block storage volume attached to the driver is the root path for code executed locally. This includes:
- Most Python code (not PySpark)
- Most Scala code (not Spark)
In Databricks Runtime 14.0 and above, the the default current working directory (CWD) for code executed locally is the directory containing the notebook or script being run. This is a change in behavior from Databricks Runtime 13.3 LTS and below. See What is the default current working directory in Databricks Runtime 14.0 and above?.
Access files on the DBFS root
When using commands that default to the DBFS root, you can use the relative path or include
SELECT * FROM parquet.`<path>`; SELECT * FROM parquet.`dbfs:/<path>`
df = spark.read.load("<path>") df.write.save("<path>")
%fs <command> /<path>
When using commands that default to the driver volume, you must use
/dbfs before the path.
%sh <command> /dbfs/<path>/
import os os.<command>('/dbfs/<path>')
When using commands that default to the driver storage, you can provide a relative or absolute path.
%sh <command> /<path>
import os os.<command>('/<path>')
When using commands that default to the DBFS root, you must use
%fs <command> file:/<path>
Because these files live on the attached driver volumes and Spark is a distributed processing engine, not all operations can directly access data here. If you need to move data from the driver filesystem to DBFS, you can copy files using magic commands or the Databricks utilities.
dbutils.fs.cp ("file:/<path>", "dbfs:/<path>")
%sh cp /<path> /dbfs/<path>
%fs cp file:/<path> /<path>
Understand default locations with examples
The table and diagram summarize and illustrate the commands described in this section and when to use each syntax.
Commands leveraging open source or driver-only execution use FUSE to access data in cloud object storage. Adding
/dbfs to the file path automatically uses the DBFS implementation of FUSE.
|Command||Default location||To read from DBFS root||To read from local filesystem|
||Local driver node||Add
||Local driver node||Add
||DBFS root||Not supported|
# Default location for %fs is root %fs ls /tmp/ %fs mkdirs /tmp/my_cloud_dir %fs cp /tmp/test_dbfs.txt /tmp/file_b.txt
# Default location for dbutils.fs is root dbutils.fs.ls ("/tmp/") dbutils.fs.put("/tmp/my_new_file", "This is a file in cloud storage.")
# Default location for %sh is the local filesystem %sh ls /dbfs/tmp/
# Default location for os commands is the local filesystem import os os.listdir('/dbfs/tmp')
# With %fs and dbutils.fs, you must use file:/ to read from local filesystem %fs ls file:/tmp %fs mkdirs file:/tmp/my_local_dir dbutils.fs.ls ("file:/tmp/") dbutils.fs.put("file:/tmp/my_new_file", "This is a file on the local driver node.")
# %sh reads from the local filesystem by default %sh ls /tmp
Access files on mounted object storage
Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.
dbutils.fs.ls("/mnt/mymount") df = spark.read.format("text").load("dbfs:/mnt/mymount/my_file.txt")
The following lists the limitations in local file API usage with DBFS root and mounts in Databricks Runtime.
- Does not support credential passthrough.
- Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to
/dbfs. For example:
# python import xlsxwriter from shutil import copyfile workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx') worksheet = workbook.add_worksheet() worksheet.write(0, 0, "Key") worksheet.write(0, 1, "Value") workbook.close() copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
- No sparse files. To copy sparse files, use
$ cp sparse.file /dbfs/sparse.file error writing '/dbfs/sparse.file': Operation not supported $ cp --sparse=never sparse.file /dbfs/sparse.file