Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
notebookutils.fs provides utilities for working with various file systems, including Azure Data Lake Storage (ADLS) Gen2 and Azure Blob Storage. Make sure you configure access to Azure Data Lake Storage Gen2 and Azure Blob Storage appropriately.
Run the following commands for an overview of the available methods:
notebookutils.fs.help()
The following table lists the available file system methods:
| Method | Signature | Description |
|---|---|---|
ls |
ls(path: String): Array |
Lists the contents of a directory. |
mkdirs |
mkdirs(path: String): Boolean |
Creates the given directory if it doesn't exist, also creating any necessary parent directories. |
cp |
cp(src: String, dest: String, recurse: Boolean = false): Boolean |
Copies a file or directory, possibly across file systems. |
fastcp |
fastcp(src: String, dest: String, recurse: Boolean = true, extraConfigs: Map = None): Boolean |
Copies a file or directory via azcopy for better performance with large data volumes. |
mv |
mv(src: String, dest: String, create_path: Boolean, overwrite: Boolean = false): Boolean |
Moves a file or directory, possibly across file systems. |
put |
put(file: String, content: String, overwrite: Boolean = false): Boolean |
Writes the given string out to a file, encoded in UTF-8. |
head |
head(file: String, max_bytes: int = 1024 * 100): String |
Returns up to the first max_bytes bytes of the given file as a String encoded in UTF-8. |
append |
append(file: String, content: String, createFileIfNotExists: Boolean = false): Boolean |
Appends the content to a file. |
rm |
rm(path: String, recurse: Boolean = false): Boolean |
Removes a file or directory. |
exists |
exists(path: String): Boolean |
Checks if a file or directory exists. |
getProperties |
getProperties(path: String): Map |
Gets the properties of the given path. Available in Python notebooks only (not supported in PySpark, Scala, or R). |
Note
All file system methods are available in Python, PySpark, Scala, and R notebooks unless otherwise noted. Scala uses camelCase parameter names (for example, createPath instead of create_path, maxBytes instead of max_bytes).
For mount and unmount operations, see File mount and unmount.
Note
Keep the following constraints and considerations in mind when you work with notebookutils.fs:
- Path behavior varies by notebook type: In Spark notebooks, relative paths resolve to the default Lakehouse ABFSS path. In Python notebooks, relative paths resolve to the local file system working directory (
/home/trusted-service-user/work). - Concurrent write limitations:
notebookutils.fs.append()andnotebookutils.fs.put()don't support concurrent writes to the same file due to a lack of atomicity guarantees. - Append loop delay: When using
notebookutils.fs.append()in loops, add 0.5-1 second sleep between writes for data integrity. - OneLake shortcut limitations: For S3/GCS type shortcuts, use mounted paths instead of ABFS paths for
cp()andfastcp()operations. - Cross-region limitations:
fastcp()doesn't support copying files in OneLake across regions. Usecp()instead. - Runtime version: NotebookUtils is designed to work with Spark 3.4 (Runtime v1.2) and above.
cp()behavior in Python notebooks: In Python notebooks,cp()internally uses the same azcopy-based mechanism asfastcp(), so both methods behave identically.
NotebookUtils works with the file system in the same way as Spark APIs. Take notebookutils.fs.mkdirs() and Lakehouse usage for example:
| Usage | Relative path from HDFS root | Absolute path for ABFS file system | Absolute path for local file system in driver node |
|---|---|---|---|
| Non-default Lakehouse | Not supported | notebookutils.fs.mkdirs("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<new_dir>") |
notebookutils.fs.mkdirs("file:/<new_dir>") |
| Default Lakehouse | Directory under 'Files' or 'Tables': notebookutils.fs.mkdirs("Files/<new_dir>") |
notebookutils.fs.mkdirs("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<new_dir>") |
notebookutils.fs.mkdirs("file:/<new_dir>") |
For the default Lakehouse, file paths are mounted in your notebook with a default file cache timeout of 120 seconds. This means that files are cached in the notebook's local temporary folder for 120 seconds, even if they're removed from the Lakehouse. If you want to change the timeout rule, you can unmount the default Lakehouse file paths and mount them again with a different
fileCacheTimeoutvalue.For non-default Lakehouse configurations, you can set the appropriate
fileCacheTimeoutparameter during the mounting of the Lakehouse paths. Setting the timeout to 0 ensures that the latest file is fetched from the Lakehouse server.
List files
To list the content of a directory, use notebookutils.fs.ls('Your directory path'). For example:
notebookutils.fs.ls("Files/tmp") # Relative path works with different base paths depending on notebook type
notebookutils.fs.ls("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>") # Absolute path using ABFS file system
notebookutils.fs.ls("file:/tmp") # Full path of the local file system of driver node
The notebookutils.fs.ls() API behaves differently when using a relative path, depending on the type of notebook.
In a Spark notebook: The relative path is relative to the default Lakehouse's ABFSS path. For example,
notebookutils.fs.ls("Files")points to theFilesdirectory in the default Lakehouse.For example:
notebookutils.fs.ls("Files/sample_datasets/public_holidays.parquet")In a Python notebook: The relative path is relative to the local file system's working directory, which by default is
/home/trusted-service-user/work. Therefore, you should use the full path instead of a relative pathnotebookutils.fs.ls("/lakehouse/default/Files")to access theFilesdirectory in the default Lakehouse.For example:
notebookutils.fs.ls("/lakehouse/default/Files/sample_datasets/public_holidays.parquet")
View file properties
Use notebookutils.fs.ls() to inspect file properties such as file name, file path, file size, and whether an item is a file or directory.
files = notebookutils.fs.ls('Your directory path')
for file in files:
print(file.name, file.isDir, file.isFile, file.path, file.size)
Use f-strings if you want more readable output:
files = notebookutils.fs.ls("Files/data")
for file in files:
print(f"Name: {file.name}, Size: {file.size}, IsDir: {file.isDir}, Path: {file.path}")
Create new directory
Create a directory if it doesn't exist, including any necessary parent directories.
notebookutils.fs.mkdirs('new directory name')
notebookutils.fs.mkdirs("Files/<new_dir>") # Works with the default Lakehouse files using relative path
notebookutils.fs.mkdirs("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<new_dir>") # Based on ABFS file system
notebookutils.fs.mkdirs("file:/<new_dir>") # Based on local file system of driver node
Copy file
Copy a file or directory across file systems. Set recurse=True to copy directories recursively.
notebookutils.fs.cp('source file or directory', 'destination file or directory', recurse=True)
Note
Python notebook note: In Python notebooks, cp() internally uses the same azcopy-based mechanism as fastcp(), providing efficient performance for both methods.
Due to the limitations of OneLake shortcut, when you need to use notebookutils.fs.cp() to copy data from S3/GCS type shortcut, it's recommended to use a mounted path instead of an abfss path.
Tip
Always check the Boolean return value to verify whether the operation succeeded. Use notebookutils.fs.exists() to verify the source path before you start a copy operation.
The following example shows a cross-storage copy from the default Lakehouse to an ADLS Gen2 account:
notebookutils.fs.cp(
"Files/local_data",
"abfss://<container>@<account>.dfs.core.windows.net/remote_data",
recurse=True
)
Performant copy file
Use fastcp for more efficient copy operations, especially with large data volumes. The recurse parameter defaults to True.
notebookutils.fs.fastcp('source file or directory', 'destination file or directory', recurse=True)
Tip
Use fastcp() instead of cp() for large data transfers. The fastcp method uses azcopy under the hood, which provides significantly better throughput for bulk file operations. In Python notebooks, both cp() and fastcp() use the same underlying mechanism.
Keep these considerations in mind:
notebookutils.fs.fastcp()doesn't support copying files in OneLake across regions. In this case, you can usenotebookutils.fs.cp()instead.- Due to the limitations of OneLake shortcut, when you need to use
notebookutils.fs.fastcp()to copy data from S3/GCS type shortcut, it's recommended to use a mounted path instead of an abfss path.
Preview file content
Return up to the first max_bytes bytes of a file as a UTF-8 string.
notebookutils.fs.head('file path', max_bytes)
Tip
For large files, use head() with an appropriate max_bytes value to avoid memory issues. The default value is 100 KB (1024 * 100).
The following example reads the first 1,000 bytes of a file:
content = notebookutils.fs.head("Files/data/sample.txt", 1000)
print(content)
Note
The default value for max_bytes differs across languages: Python and Scala notebooks use 102400 (100 KB), while R notebooks use 65535 (64 KB). In Scala, this parameter is named maxBytes.
Move file
Move a file or directory across file systems.
notebookutils.fs.mv('source file or directory', 'destination directory', create_path=True, overwrite=True)
Important
The create_path parameter default varies by runtime:
- Spark notebooks (PySpark, Scala, R): defaults to
False(falsein Scala,FALSEin R). The parent directory must exist before the move operation. - Python notebooks: defaults to
True. The parent directory is automatically created if it doesn't exist.
To ensure consistent behavior across runtimes, explicitly set the create_path parameter in your code. In Scala, this parameter is named createPath.
Use named parameters if you want clearer code:
notebookutils.fs.mv("Files/source.csv", "Files/new_folder/dest.csv", create_path=True, overwrite=True)
Write file
Write a UTF-8 string to a file.
notebookutils.fs.put("file path", "content to write", True) # Set the last parameter as True to overwrite the file if it already exists
Append content to a file
Append a UTF-8 string to a file.
notebookutils.fs.append("file path", "content to append", True) # Set the last parameter as True to create the file if it doesn't exist
Important
notebookutils.fs.append() and notebookutils.fs.put() don't support concurrent writing to the same file due to a lack of atomicity guarantees.
When using the notebookutils.fs.append API in a for loop to write to the same file, add a sleep statement of about 0.5 to 1 second between the recurring writes. This recommendation is because the notebookutils.fs.append API's internal flush operation is asynchronous, so a short delay helps ensure data integrity.
import time
for i in range(100):
notebookutils.fs.append("Files/output/data.txt", f"Line {i}\n", True)
time.sleep(0.5) # Prevent data integrity issues
Delete file or directory
Remove a file or directory. Set recurse=True to remove directories recursively.
notebookutils.fs.rm('file path', recurse=True)
Check if a file or directory exists
Check whether a file or directory exists at the specified path. It returns True if the path exists; otherwise, it returns False.
notebookutils.fs.exists("Files/data/input.csv")
Tip
Use exists() before performing file operations to prevent errors. For example, check that a source file exists before you try to copy or move it.
if notebookutils.fs.exists("Files/data/input.csv"):
notebookutils.fs.cp("Files/data/input.csv", "Files/backup/input.csv")
print("File copied successfully.")
else:
print("Source file not found.")
Get file properties
Get properties for a path as a map of name-value pairs. It's only supported for Azure Blob Storage paths.
Note
The getProperties method is available only in Python notebooks. It isn't supported in Spark notebooks (PySpark, Scala, or R).
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
path |
String | Yes | ABFS path to the file or directory. |
Returns: A dictionary (map) containing metadata properties such as file size, creation time, last modified time, and content type.
properties = notebookutils.fs.getProperties("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>")
print(properties)