Редагувати

Поділитися через


Use Python to manage directories and files in Azure Data Lake Storage

This article shows you how to use Python to create and manage directories and files in storage accounts that have a hierarchical namespace.

To learn about how to get, set, and update the access control lists (ACL) of directories and files, see Use Python to manage ACLs in Azure Data Lake Storage.

Package (PyPi) | Samples | API reference | Gen1 to Gen2 mapping | Give Feedback

Prerequisites

Set up your project

This section walks you through preparing a project to work with the Azure Data Lake Storage client library for Python.

From your project directory, install packages for the Azure Data Lake Storage and Azure Identity client libraries using the pip install command. The azure-identity package is needed for passwordless connections to Azure services.

pip install azure-storage-file-datalake azure-identity

Then open your code file and add the necessary import statements. In this example, we add the following to our .py file:

import os
from azure.storage.filedatalake import (
    DataLakeServiceClient,
    DataLakeDirectoryClient,
    FileSystemClient
)
from azure.identity import DefaultAzureCredential

Note

Multi-protocol access on Data Lake Storage enables applications to use both Blob APIs and Data Lake Storage Gen2 APIs to work with data in storage accounts with hierarchical namespace (HNS) enabled. When working with capabilities unique to Data Lake Storage Gen2, such as directory operations and ACLs, use the Data Lake Storage Gen2 APIs, as shown in this article.

When choosing which APIs to use in a given scenario, consider the workload and the needs of your application, along with the known issues and impact of HNS on workloads and applications.

Authorize access and connect to data resources

To work with the code examples in this article, you need to create an authorized DataLakeServiceClient instance that represents the storage account. You can authorize a DataLakeServiceClient object using Microsoft Entra ID, an account access key, or a shared access signature (SAS).

You can use the Azure identity client library for Python to authenticate your application with Microsoft Entra ID.

Create an instance of the DataLakeServiceClient class and pass in a DefaultAzureCredential object.

def get_service_client_token_credential(self, account_name) -> DataLakeServiceClient:
    account_url = f"https://{account_name}.dfs.core.windows.net"
    token_credential = DefaultAzureCredential()

    service_client = DataLakeServiceClient(account_url, credential=token_credential)

    return service_client

To learn more about using DefaultAzureCredential to authorize access to data, see Overview: Authenticate Python apps to Azure using the Azure SDK.

Create a container

A container acts as a file system for your files. You can create a container by using the following method:

The following code example creates a container and returns a FileSystemClient object for later use:

def create_file_system(self, service_client: DataLakeServiceClient, file_system_name: str) -> FileSystemClient:
    file_system_client = service_client.create_file_system(file_system=file_system_name)

    return file_system_client

Create a directory

You can create a directory reference in the container by using the following method:

The following code example adds a directory to a container and returns a DataLakeDirectoryClient object for later use:

def create_directory(self, file_system_client: FileSystemClient, directory_name: str) -> DataLakeDirectoryClient:
    directory_client = file_system_client.create_directory(directory_name)

    return directory_client

Rename or move a directory

You can rename or move a directory by using the following method:

Pass the path with the new directory name in the new_name argument. The value must have the following format: {filesystem}/{directory}/{subdirectory}.

The following code example shows how to rename a subdirectory:

def rename_directory(self, directory_client: DataLakeDirectoryClient, new_dir_name: str):
    directory_client.rename_directory(
        new_name=f"{directory_client.file_system_name}/{new_dir_name}")

Upload a file to a directory

You can upload content to a new or existing file by using the following method:

The following code example shows how to upload a file to a directory using the upload_data method:

def upload_file_to_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
    file_client = directory_client.get_file_client(file_name)

    with open(file=os.path.join(local_path, file_name), mode="rb") as data:
        file_client.upload_data(data, overwrite=True)

You can use this method to create and upload content to a new file, or you can set the overwrite argument to True to overwrite an existing file.

Append data to a file

You can upload data to be appended to a file by using the following method:

The following code example shows how to append data to the end of a file using these steps:

  • Create a DataLakeFileClient object to represent the file resource you're working with.
  • Upload data to the file using the append_data method.
  • Complete the upload by calling the flush_data method to write the previously uploaded data to the file.
def append_data_to_file(self, directory_client: DataLakeDirectoryClient, file_name: str):
    file_client = directory_client.get_file_client(file_name)
    file_size = file_client.get_file_properties().size
    
    data = b"Data to append to end of file"
    file_client.append_data(data, offset=file_size, length=len(data))

    file_client.flush_data(file_size + len(data))

With this method, data can only be appended to a file and the operation is limited to 4000 MiB per request.

Download from a directory

The following code example shows how to download a file from a directory to a local file using these steps:

  • Create a DataLakeFileClient object to represent the file you want to download.
  • Open a local file for writing.
  • Call the DataLakeFileClient.download_file method to read from the file, then write the data to the local file.
def download_file_from_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
    file_client = directory_client.get_file_client(file_name)

    with open(file=os.path.join(local_path, file_name), mode="wb") as local_file:
        download = file_client.download_file()
        local_file.write(download.readall())
        local_file.close()

List directory contents

You can list directory contents by using the following method and enumerating the result:

Enumerating the paths in the result may make multiple requests to the service while fetching the values.

The following code example prints the path of each subdirectory and file that is located in a directory:

def list_directory_contents(self, file_system_client: FileSystemClient, directory_name: str):
    paths = file_system_client.get_paths(path=directory_name)

    for path in paths:
        print(path.name + '\n')

Delete a directory

You can delete a directory by using the following method:

The following code example shows how to delete a directory:

def delete_directory(self, directory_client: DataLakeDirectoryClient):
    directory_client.delete_directory()

See also