Python을 사용하여 Azure Data Lake Storage Gen2에서 디렉터리 및 파일 관리

아티클
04/10/2024

이 문서에서는 Python을 사용하여 계층 구조 네임스페이스가 있는 스토리지 계정에서 디렉터리 및 파일을 만들고 관리하는 방법을 보여 줍니다.

디렉터리 및 파일의 ACL(액세스 제어 목록)을 가져오거나 설정하고 업데이트하는 방법에 대한 자세한 내용은 Python을 사용하여 Azure Data Lake Storage Gen2에서 ACL 관리를 참조하세요.

패키지(PyPi) | 샘플 | API 참조 | Gen1에서 Gen2로 매핑 | 피드백 제공

필수 구성 요소

Azure 구독 Azure 평가판을 참조하세요.
계층 구조 네임스페이스가 사용하도록 설정된 스토리지 계정입니다. 이러한 지침에 따라 라이브러리를 만듭니다.

프로젝트 설정

이 섹션에서는 Python용 Azure Data Lake Storage 클라이언트 라이브러리를 사용하는 프로젝트를 준비합니다.

프로젝트 디렉터리에서 pip install 명령을 사용하여 Azure Data Lake Storage 및 Azure ID 클라이언트 라이브러리용 패키지를 설치합니다. Azure 서비스에 암호 없이 연결하려면 azure-identity 패키지가 필요합니다.

pip install azure-storage-file-datalake azure-identity

그런 다음 코드 파일을 열고 필요한 가져오기 문을 추가합니다. 이 예에서는 .py 파일에 다음을 추가합니다.

import os
from azure.storage.filedatalake import (
    DataLakeServiceClient,
    DataLakeDirectoryClient,
    FileSystemClient
)
from azure.identity import DefaultAzureCredential

참고 항목

Data Lake Storage의 다중 프로토콜 액세스를 통해 애플리케이션은 Blob API와 Data Lake Storage Gen2 API를 모두 사용하여 HNS(계층 구조 네임스페이스)가 사용하도록 설정된 스토리지 계정의 데이터로 작업할 수 있습니다. 디렉터리 작업 및 ACL과 같은 Data Lake Storage Gen2의 고유한 기능을 사용하는 경우 이 문서에 표시된 대로 Data Lake Storage Gen2 API를 사용합니다.

특정 시나리오에서 사용할 API를 선택할 때 알려진 문제 및 HNS가 워크로드 및 애플리케이션에 미치는 영향과 함께 애플리케이션의 워크로드와 요구 사항을 고려합니다.

액세스 권한 부여 및 데이터 리소스에 연결

이 문서의 코드 예제를 사용하려면 스토리지 계정을 나타내는 권한 있는 DataLakeServiceClient 인스턴스를 만들어야 합니다. Microsoft Entra ID, 계정 액세스 키 또는 SAS(공유 액세스 서명)를 사용하여 DataLakeServiceClient 개체에 권한을 부여할 수 있습니다.

Python용 Azure ID 클라이언트 라이브러리를 사용하여 Microsoft Entra ID로 애플리케이션을 인증할 수 있습니다.

DataLakeServiceClient 클래스의 인스턴스를 만들고 DefaultAzureCredential 개체를 전달합니다.

def get_service_client_token_credential(self, account_name) -> DataLakeServiceClient:
    account_url = f"https://{account_name}.dfs.core.windows.net"
    token_credential = DefaultAzureCredential()

    service_client = DataLakeServiceClient(account_url, credential=token_credential)

    return service_client

DefaultAzureCredential을 사용하여 데이터에 대한 액세스 권한을 부여하는 방법에 대한 자세한 내용은 개요: Azure SDK를 사용하여 Azure에 Python 앱 인증을 참조하세요.

SAS(공유 액세스 서명) 토큰을 사용하려면 토큰을 문자열로 제공하고 DataLakeServiceClient 개체를 초기화합니다. 계정 URL에 SAS 토큰이 포함된 경우 자격 증명 매개 변수를 생략합니다.

def get_service_client_sas(self, account_name: str, sas_token: str) -> DataLakeServiceClient:
    account_url = f"https://{account_name}.dfs.core.windows.net"

    # The SAS token string can be passed in as credential param or appended to the account URL
    service_client = DataLakeServiceClient(account_url, credential=sas_token)

    return service_client

SAS 토큰 생성 및 관리에 대한 자세한 내용은 다음 문서를 참조하세요.

SAS(공유 액세스 서명)를 사용하여 Azure Storage 리소스에 대한 제한된 액세스 권한 부여

계정 액세스 키(공유 키)를 사용하여 데이터에 대한 액세스 권한을 부여할 수 있습니다. 다음 코드 예제에서는 계정 키로 권한이 부여된 DataLakeServiceClient 인스턴스를 만듭니다.

def get_service_client_account_key(self, account_name, account_key) -> DataLakeServiceClient:
    account_url = f"https://{account_name}.dfs.core.windows.net"
    service_client = DataLakeServiceClient(account_url, credential=account_key)

    return service_client

주의

공유 키를 사용한 권한 부여는 안전하지 않을 수 있어 권장하지 않습니다. 최적의 보안을 위해 Azure Storage 계정에 대한 공유 키 권한 부여 방지에 설명된 대로 스토리지 계정에 대해 공유 키를 통한 권한 부여를 비활성화합니다.

액세스 키 및 연결 문자열 사용은 프로덕션 또는 중요한 데이터에 액세스하지 않는 초기 개념 증명 앱 또는 개발 프로토타입으로 제한되어야 합니다. 그렇지 않으면 Azure 리소스에 인증할 때 Azure SDK에서 사용할 수 있는 토큰 기반 인증 클래스를 항상 기본으로 설정해야 합니다.

Microsoft에서는 클라이언트가 Microsoft Entra ID 또는 SAS(공유 액세스 서명)를 사용하여 Azure Storage의 데이터에 대한 액세스 권한을 부여하는 것이 좋습니다. 자세한 내용은 데이터 액세스에 대한 작업 권한 부여를 참조하세요.

컨테이너 만들기

컨테이너는 파일의 파일 시스템 역할을 합니다. 다음 방법을 사용하여 컨테이너를 만들 수 있습니다.

DataLakeServiceClient.create_file_system

다음 코드 예제에서는 컨테이너를 만들고 나중에 사용할 FileSystemClient 개체를 반환합니다.

def create_file_system(self, service_client: DataLakeServiceClient, file_system_name: str) -> FileSystemClient:
    file_system_client = service_client.create_file_system(file_system=file_system_name)

    return file_system_client

디렉터리 만들기

다음 메서드를 사용하여 컨테이너에서 디렉터리 참조를 만들 수 있습니다.

FileSystemClient.create_directory

다음 코드 예제에서는 컨테이너에 디렉터리를 추가하고 나중에 사용할 DataLakeDirectoryClient 개체를 반환합니다.

def create_directory(self, file_system_client: FileSystemClient, directory_name: str) -> DataLakeDirectoryClient:
    directory_client = file_system_client.create_directory(directory_name)

    return directory_client

디렉터리 이름 바꾸기 또는 이동

다음 메서드를 사용하여 디렉터리의 이름을 바꾸거나 이동할 수 있습니다.

DataLakeDirectoryClient.rename_directory

new_name 인수에 새 디렉터리 이름을 가진 경로를 전달합니다. 값은 {filesystem}/{directory}/{subdirectory} 형식이어야 합니다.

다음 코드 예제에서는 하위 디렉터리의 이름을 바꾸는 방법을 보여줍니다.

def rename_directory(self, directory_client: DataLakeDirectoryClient, new_dir_name: str):
    directory_client.rename_directory(
        new_name=f"{directory_client.file_system_name}/{new_dir_name}")

디렉터리에 파일 업로드

다음 방법을 사용하여 새 파일 또는 기존 파일에 콘텐츠를 업로드할 수 있습니다.

DataLakeFileClient.upload_data

다음 코드 예제에서는 upload_data 메서드를 사용하여 디렉터리에 파일을 업로드하는 방법을 보여줍니다.

def upload_file_to_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
    file_client = directory_client.get_file_client(file_name)

    with open(file=os.path.join(local_path, file_name), mode="rb") as data:
        file_client.upload_data(data, overwrite=True)

이 메서드를 사용하여 콘텐츠를 만들고 새 파일에 업로드하거나 overwrite 인수를 True로 설정하여 기존 파일을 덮어쓸 수 있습니다.

파일에 데이터 추가

다음 메서드를 사용하여 파일에 추가할 데이터를 업로드할 수 있습니다.

DataLakeFileClient.append_data 메서드.

다음 코드 예제에서는 다음 단계를 사용하여 파일 끝에 데이터를 추가하는 방법을 보여 줍니다.

작업 중인 파일 리소스를 나타내는 DataLakeFileClient 개체를 만듭니다.
append_data 메서드를 사용하여 파일에 데이터를 업로드합니다.
flush_data 메서드를 호출해 업로드를 완료하여 이전에 업로드한 데이터를 파일에 씁니다.

def append_data_to_file(self, directory_client: DataLakeDirectoryClient, file_name: str):
    file_client = directory_client.get_file_client(file_name)
    file_size = file_client.get_file_properties().size
    
    data = b"Data to append to end of file"
    file_client.append_data(data, offset=file_size, length=len(data))

    file_client.flush_data(file_size + len(data))

이 메서드를 사용하면 파일에만 데이터를 추가할 수 있으며 작업은 요청당 4000MiB로 제한됩니다.

디렉터리에서 다운로드

다음 코드 예제에서는 다음 단계를 사용하여 디렉터리에서 로컬 파일로 파일을 다운로드하는 방법을 보여줍니다.

다운로드할 파일을 나타내는 DataLakeFileClient 개체를 만듭니다.
로컬 파일을 쓰기용으로 엽니다.
DataLakeFileClient.download_file 메서드를 호출하여 파일에서 읽은 다음, 로컬 파일에 데이터를 씁니다.

def download_file_from_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
    file_client = directory_client.get_file_client(file_name)

    with open(file=os.path.join(local_path, file_name), mode="wb") as local_file:
        download = file_client.download_file()
        local_file.write(download.readall())
        local_file.close()

디렉터리 콘텐츠 나열

다음 메서드를 사용하고 결과를 열거하여 디렉터리 콘텐츠를 나열할 수 있습니다.

FileSystemClient.get_paths

결과의 경로를 열거하면 값을 가져오는 동안 서비스에 여러 요청을 할 수 있습니다.

다음 코드 예제에서는 디렉터리에 있는 각 하위 디렉터리와 파일의 경로를 인쇄합니다.

def list_directory_contents(self, file_system_client: FileSystemClient, directory_name: str):
    paths = file_system_client.get_paths(path=directory_name)

    for path in paths:
        print(path.name + '\n')

디렉터리 삭제

다음 메서드를 사용하여 디렉터리를 삭제할 수 있습니다.

DataLakeDirectoryClient.delete_directory

다음 코드 예제에서는 디렉터리를 삭제하는 방법을 보여 줍니다.

def delete_directory(self, directory_client: DataLakeDirectoryClient):
    directory_client.delete_directory()

다음을 통해 공유