Use Python to manage directories and files in Azure Data Lake Storage
This article shows you how to use Python to create and manage directories and files in storage accounts that have a hierarchical namespace.
To learn about how to get, set, and update the access control lists (ACL) of directories and files, see Use Python to manage ACLs in Azure Data Lake Storage.
Package (PyPi) | Samples | API reference | Gen1 to Gen2 mapping | Give Feedback
Prerequisites
An Azure subscription. See Get Azure free trial.
A storage account that has hierarchical namespace enabled. Follow these instructions to create one.
Set up your project
This section walks you through preparing a project to work with the Azure Data Lake Storage client library for Python.
From your project directory, install packages for the Azure Data Lake Storage and Azure Identity client libraries using the pip install
command. The azure-identity package is needed for passwordless connections to Azure services.
pip install azure-storage-file-datalake azure-identity
Then open your code file and add the necessary import statements. In this example, we add the following to our .py file:
import os
from azure.storage.filedatalake import (
DataLakeServiceClient,
DataLakeDirectoryClient,
FileSystemClient
)
from azure.identity import DefaultAzureCredential
Note
Multi-protocol access on Data Lake Storage enables applications to use both Blob APIs and Data Lake Storage Gen2 APIs to work with data in storage accounts with hierarchical namespace (HNS) enabled. When working with capabilities unique to Data Lake Storage Gen2, such as directory operations and ACLs, use the Data Lake Storage Gen2 APIs, as shown in this article.
When choosing which APIs to use in a given scenario, consider the workload and the needs of your application, along with the known issues and impact of HNS on workloads and applications.
Authorize access and connect to data resources
To work with the code examples in this article, you need to create an authorized DataLakeServiceClient instance that represents the storage account. You can authorize a DataLakeServiceClient
object using Microsoft Entra ID, an account access key, or a shared access signature (SAS).
You can use the Azure identity client library for Python to authenticate your application with Microsoft Entra ID.
Create an instance of the DataLakeServiceClient class and pass in a DefaultAzureCredential object.
def get_service_client_token_credential(self, account_name) -> DataLakeServiceClient:
account_url = f"https://{account_name}.dfs.core.windows.net"
token_credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(account_url, credential=token_credential)
return service_client
To learn more about using DefaultAzureCredential
to authorize access to data, see Overview: Authenticate Python apps to Azure using the Azure SDK.
Create a container
A container acts as a file system for your files. You can create a container by using the following method:
The following code example creates a container and returns a FileSystemClient
object for later use:
def create_file_system(self, service_client: DataLakeServiceClient, file_system_name: str) -> FileSystemClient:
file_system_client = service_client.create_file_system(file_system=file_system_name)
return file_system_client
Create a directory
You can create a directory reference in the container by using the following method:
The following code example adds a directory to a container and returns a DataLakeDirectoryClient
object for later use:
def create_directory(self, file_system_client: FileSystemClient, directory_name: str) -> DataLakeDirectoryClient:
directory_client = file_system_client.create_directory(directory_name)
return directory_client
Rename or move a directory
You can rename or move a directory by using the following method:
Pass the path with the new directory name in the new_name
argument. The value must have the following format: {filesystem}/{directory}/{subdirectory}.
The following code example shows how to rename a subdirectory:
def rename_directory(self, directory_client: DataLakeDirectoryClient, new_dir_name: str):
directory_client.rename_directory(
new_name=f"{directory_client.file_system_name}/{new_dir_name}")
Upload a file to a directory
You can upload content to a new or existing file by using the following method:
The following code example shows how to upload a file to a directory using the upload_data method:
def upload_file_to_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
file_client = directory_client.get_file_client(file_name)
with open(file=os.path.join(local_path, file_name), mode="rb") as data:
file_client.upload_data(data, overwrite=True)
You can use this method to create and upload content to a new file, or you can set the overwrite
argument to True
to overwrite an existing file.
Append data to a file
You can upload data to be appended to a file by using the following method:
- DataLakeFileClient.append_data method.
The following code example shows how to append data to the end of a file using these steps:
- Create a
DataLakeFileClient
object to represent the file resource you're working with. - Upload data to the file using the append_data method.
- Complete the upload by calling the flush_data method to write the previously uploaded data to the file.
def append_data_to_file(self, directory_client: DataLakeDirectoryClient, file_name: str):
file_client = directory_client.get_file_client(file_name)
file_size = file_client.get_file_properties().size
data = b"Data to append to end of file"
file_client.append_data(data, offset=file_size, length=len(data))
file_client.flush_data(file_size + len(data))
With this method, data can only be appended to a file and the operation is limited to 4000 MiB per request.
Download from a directory
The following code example shows how to download a file from a directory to a local file using these steps:
- Create a
DataLakeFileClient
object to represent the file you want to download. - Open a local file for writing.
- Call the DataLakeFileClient.download_file method to read from the file, then write the data to the local file.
def download_file_from_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
file_client = directory_client.get_file_client(file_name)
with open(file=os.path.join(local_path, file_name), mode="wb") as local_file:
download = file_client.download_file()
local_file.write(download.readall())
local_file.close()
List directory contents
You can list directory contents by using the following method and enumerating the result:
Enumerating the paths in the result may make multiple requests to the service while fetching the values.
The following code example prints the path of each subdirectory and file that is located in a directory:
def list_directory_contents(self, file_system_client: FileSystemClient, directory_name: str):
paths = file_system_client.get_paths(path=directory_name)
for path in paths:
print(path.name + '\n')
Delete a directory
You can delete a directory by using the following method:
The following code example shows how to delete a directory:
def delete_directory(self, directory_client: DataLakeDirectoryClient):
directory_client.delete_directory()