Filesystem operations on Azure Data Lake Storage Gen1 using Python
In this article, you learn how to use Python SDK to perform filesystem operations on Azure Data Lake Storage Gen1. For instructions on how to perform account management operations on Data Lake Storage Gen1 using Python, see Account management operations on Data Lake Storage Gen1 using Python.
Prerequisites
Python. You can download Python from here. This article uses Python 3.6.2.
An Azure subscription. See Get Azure free trial.
Azure Data Lake Storage Gen1 account. Follow the instructions at Get started with Azure Data Lake Storage Gen1 using the Azure portal.
Install the modules
To work with Data Lake Storage Gen1 using Python, you need to install three modules.
- The
azure-mgmt-resource
module, which includes Azure modules for Active Directory, etc. - The
azure-mgmt-datalake-store
module, which includes the Azure Data Lake Storage Gen1 account management operations. For more information on this module, see the azure-mgmt-datalake-store module reference. - The
azure-datalake-store
module, which includes the Azure Data Lake Storage Gen1 filesystem operations. For more information on this module, see the azure-datalake-store file-system module reference.
Use the following commands to install the modules.
pip install azure-mgmt-resource
pip install azure-mgmt-datalake-store
pip install azure-datalake-store
Create a new Python application
In the IDE of your choice create a new Python application, for example, mysample.py.
Add the following lines to import the required modules
## Use this only for Azure AD service-to-service authentication from azure.common.credentials import ServicePrincipalCredentials ## Use this only for Azure AD end-user authentication from azure.common.credentials import UserPassCredentials ## Use this only for Azure AD multi-factor authentication from msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Storage Gen1 account management from azure.mgmt.datalake.store import DataLakeStoreAccountManagementClient from azure.mgmt.datalake.store.models import DataLakeStoreAccount ## Required for Azure Data Lake Storage Gen1 filesystem management from azure.datalake.store import core, lib, multithread ## Common Azure imports from azure.mgmt.resource.resources import ResourceManagementClient from azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your application import logging, getpass, pprint, uuid, time
Save changes to mysample.py.
Authentication
In this section, we talk about the different ways to authenticate with Microsoft Entra ID. The options available are:
- For end-user authentication for your application, see End-user authentication with Data Lake Storage Gen1 using Python.
- For service-to-service authentication for your application, see Service-to-service authentication with Data Lake Storage Gen1 using Python.
Create filesystem client
The following snippet first creates the Data Lake Storage Gen1 account client. It uses the client object to create a Data Lake Storage Gen1 account. Finally, the snippet creates a filesystem client object.
## Declare variables
subscriptionId = 'FILL-IN-HERE'
adlsAccountName = 'FILL-IN-HERE'
## Create a filesystem client object
adlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName)
Create a directory
## Create a directory
adlsFileSystemClient.mkdir('/mysampledirectory')
Upload a file
## Upload a file
multithread.ADLUploader(adlsFileSystemClient, lpath='C:\\data\\mysamplefile.txt', rpath='/mysampledirectory/mysamplefile.txt', nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)
Download a file
## Download a file
multithread.ADLDownloader(adlsFileSystemClient, lpath='C:\\data\\mysamplefile.txt.out', rpath='/mysampledirectory/mysamplefile.txt', nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)
Delete a directory
## Delete a directory
adlsFileSystemClient.rm('/mysampledirectory', recursive=True)