Accessing csv file from ADLS Gen2 into Jupyter notebook.

Jayesh Dave 296 Reputation points
2021-02-14T07:08:46.483+00:00

Hello:

After following this link, I tried to access ADLS Gen2 container but getting this error.

I understand that this article is old will not work with Gen2.

Can you recommend better way to connect to ADLS gen2.

Link = https://medium.com/azure-data-lake/using-jupyter-notebooks-and-pandas-with-azure-data-lake-store-48737fbad305

Error:-
DatalakeRESTException: HTTP error: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='xxxxxx.azuredatalakestore.net', port=443): Max retries exceeded with url: /webhdfs/v1/xxx/banking-dataset-marketing-targets-train.csv?OP=GETFILESTATUS&api-version=2018-09-01 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f68385719d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,502 questions
{count} votes

Accepted answer
  1. Jayesh Dave 296 Reputation points
    2021-02-19T22:49:14.183+00:00

    Hello All:

    Thank you all for your help & replies.

    The problem didn't resolved by trying everything.

    I end up creating python virtual environment just by installing "azure-storage-blob" and running jupyter notebook within virtual env. that helped me to elevate the errors, problem and all issues that i had.

    Even though I am aware, but now I strongly believe that virtual environment is the way to go for azure.

    Thank you again.

    0 comments No comments

8 additional answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,261 Reputation points
    2021-02-16T07:14:24.967+00:00

    Hello @Jayesh Dave ,

    Note: This AzureDLFileSystem library supports ADLS Gen 1. For Gen 2, please see azure-storage-file-datalake, documented here.

    Hope this helps. Do let us know if you any further queries.

    ------------

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.

  2. Jayesh Dave 296 Reputation points
    2021-02-18T01:44:43.373+00:00

    Hello Pradeep.

    Below is my code as per your suggestion and getting error.

    Code:-

    import os
    import random
    import uuid

    from azure.storage.filedatalake import DataLakeServiceClient

    account_name='kxxxxxxxxx'
    account_key='/xxxxxxxx'

    service_client = DataLakeServiceClient("https://xxxxxxx.dfs.core.windows.net", credential=account_key)

    List file systems

    [START list_file_systems]

    file_systems = service_client.list_file_systems()
    for file_system in file_systems:
    print(file_system.name)

    [END list_file_systems]

    from azure.storage.filedatalake import FileSystemClient

    Get the DataLakeDirectoryClient from the FileSystemClient to interact with a specific file

    directory_client = FileSystemClient .get_directory_client("xxxxxxxxxx")

    Error:-

    ImportError Traceback (most recent call last)
    <ipython-input-10-99dd9c21fe89> in <module>
    3 import uuid
    4
    ----> 5 from azure.storage.filedatalake import DataLakeServiceClient
    6
    7 account_name='xxxxxxxx'

    ~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/init.py in <module>
    5 # --------------------------------------------------------------------------
    6
    ----> 7 from ._download import StorageStreamDownloader
    8 from ._data_lake_file_client import DataLakeFileClient
    9 from ._data_lake_directory_client import DataLakeDirectoryClient

    ~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/_download.py in <module>
    4 # license information.
    5 # --------------------------------------------------------------------------
    ----> 6 from ._deserialize import from_blob_properties
    7
    8

    ~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/_deserialize.py in <module>
    13 from azure.core.exceptions import HttpResponseError, DecodeError, ResourceModifiedError, ClientAuthenticationError, \
    14 ResourceNotFoundError, ResourceExistsError
    ---> 15 from ._models import FileProperties, DirectoryProperties, LeaseProperties, PathProperties
    16 from ._shared.models import StorageErrorCode
    17

    ~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/_models.py in <module>
    9 from enum import Enum
    10
    ---> 11 from azure.storage.blob import LeaseProperties as BlobLeaseProperties
    12 from azure.storage.blob import AccountSasPermissions as BlobAccountSasPermissions
    13 from azure.storage.blob import ResourceTypes as BlobResourceTypes

    ImportError: cannot import name 'LeaseProperties' from 'azure.storage.blob' (unknown location)

    0 comments No comments

  3. Jayesh Dave 296 Reputation points
    2021-02-15T18:17:29.413+00:00

    Hello Pradeep:

    Thanks for y our reply.

    Jupyter Notebook is local on my laptop. I am Group admin on Azure and I have required permission under ADLS Gen2 Container.

    Below you will see that I tried 3 different ways\option to access ADLS Gen2 from my laptop jupyter notebook and encounter similar & different errors.

    Again, any help with proper document link is greatly appreciated.

    Option - 1
    AccountName='xxxxxxxxxx'
    AccountKey='/xxxxxxxxxx'
    tenant_id = 'xxxxxxxxxx'
    client_secret = 'xxxxxxxxxx'
    client_id = 'xxxxxxxxxx'
    input_blobpath = 'https://xxxxxxxxxx'

    from azure.common.credentials import ServicePrincipalCredentials
    token = lib.auth(tenant_id = tenant_id, client_secret = client_secret, client_id = client_id,require_2fa=False, \
    resource='https://xxxxxxxxxx')

    Get Token request returned http error: 400 and server response: {"error":"invalid_resource","error_description":"AADSTS500011: The resource principal named https://storageAccountName was not found in the tenant named xxxxxxxxxx. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You might have sent your authentication request to the wrong tenant.\r\nTrace ID: 1ac78efb-95a9-4317-b18d-1b3272036800\r\nCorrelation ID: xxxxxxxxxx\r\nTimestamp: 2021-02-15 17:18:57Z","error_codes":[500011],"timestamp":"2021-02-15 17:18:57Z","trace_id":"1ac78efb-95a9-4317-b18d-1b3272036800","correlation_id":"xxxxxxxxxx","error_uri":"https://login.microsoftonline.com/error?code=500011"}

    Option - 2

    adlsFileSystemClient = core.AzureDLFileSystem(token, store_name=xxxxxxxxxxxx)

    Read a file into pandas dataframe

    with adlsFileSystemClient.open('xxxxxxxxxxxxxxxxxxxx', 'rb') as f:
    df = pd.read_csv(f)

    Show the dataframe

    df

    DatalakeRESTException: HTTP error: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='kxxxxx.azuredatalakestore.net', port=443): Max retries exceeded with url: /webhdfs/v1/xxxxxxxxxxxxxx.csv?OP=GETFILESTATUS&api-version=2018-09-01 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f68387dafa0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))

    Option - 3

    tenant_id = 'xxxxxxxxxxxxxx'
    client_secret = 'xxxxxxxxxxxxxx'
    client_id = 'xxxxxxxxxxxxxx'
    container_name = 'xxxxxxxxxxxxxx'
    blob_name = 'xxxxxxxxxxxxxx.csv'

    credentials = ServicePrincipalCredentials(client_id=client_id, secret=client_secret, tenant=tenant_id)

    service = BlobServiceClient(account_url="https://xxxxxxxxx/", credential=credentials)

    block_blob_service = BlobServiceClient(account_url='https://xxxxxxxxx',account_name=account_name, account_key=account_key)

    csv_content = service.get_blob_to_text(container_name, blob_name).content
    print(csv_content)

    TypeError: Unsupported credential: <msrestazure.azure_active_directory.ServicePrincipalCredentials object at 0x7f683853f610>

    0 comments No comments

  4. Jayesh Dave 296 Reputation points
    2021-02-15T20:36:49.49+00:00

    Hello Pradeep:

    Below is option-4 and that didn't helped either.

    Option - 4

    from azure.datalake.store import core, lib, multithread
    import pandas as pd

    tenant_id = 'xxxxxxxxxxxxxx'
    username = 'xxxxxxxxxxxxxx'
    password = 'xxxxxxxxxxxxxx''
    store_name = 'xxxxxxxxxxxxxx'
    token = lib.auth(tenant_id, username, password)

    adls = core.AzureDLFileSystem(token, store_name=store_name)

    with adls.open('/xxxxxx/xyz.csv', 'rb') as f:
    df = pd.read_csv(f)

    Show the dataframe

    df

    DatalakeRESTException: HTTP error: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='xxxxxxxxx.azuredatalakestore.net', port=443): Max retries exceeded with url: /webhdfs/v1/xxxxxxxx/xyz.csv?OP=GETFILESTATUS&api-version=2018-09-01 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6bbdd55c10>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.