Accessing csv file from ADLS Gen2 into Jupyter notebook.

Question

Accessing csv file from ADLS Gen2 into Jupyter notebook.

Jayesh Dave 296

Hello:

After following this link, I tried to access ADLS Gen2 container but getting this error.

I understand that this article is old will not work with Gen2.

Can you recommend better way to connect to ADLS gen2.

Link = https://medium.com/azure-data-lake/using-jupyter-notebooks-and-pandas-with-azure-data-lake-store-48737fbad305

Error:-
DatalakeRESTException: HTTP error: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='xxxxxx.azuredatalakestore.net', port=443): Max retries exceeded with url: /webhdfs/v1/xxx/banking-dataset-marketing-targets-train.csv?OP=GETFILESTATUS&api-version=2018-09-01 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f68385719d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-02-15T06:52:46.617+00:00
Hello @Jayesh Dave ,

Welcome to the Microsoft Q&A platform.

From the question it looks like you are trying to access a file from ADLS Gen2 but the error message shows as ADLS Gen1 - (host=xxxxxx.azuredatalakestore.net)?

Could you please share more details:

How you are trying to access a file from ADLS Gen2?

Where does Jupyter Notebook hosted?

Meanwhile, you may checkout the SO thread which addressing similar issue.

Accepted answer

8 additional answers

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-02-15T06:52:46.617+00:00

Hello @Jayesh Dave ,

Welcome to the Microsoft Q&A platform.

From the question it looks like you are trying to access a file from ADLS Gen2 but the error message shows as ADLS Gen1 - (host=xxxxxx.azuredatalakestore.net)?

Could you please share more details:

How you are trying to access a file from ADLS Gen2?

Where does Jupyter Notebook hosted?

Meanwhile, you may checkout the SO thread which addressing similar issue.

Answer 1

Jayesh Dave 296

Hello All:

Thank you all for your help & replies.

The problem didn't resolved by trying everything.

I end up creating python virtual environment just by installing "azure-storage-blob" and running jupyter notebook within virtual env. that helped me to elevate the errors, problem and all issues that i had.

Even though I am aware, but now I strongly believe that virtual environment is the way to go for azure.

Thank you again.

Answer 2

Hello Pradeep:

Below is option-4 and that didn't helped either.

Option - 4

from azure.datalake.store import core, lib, multithread
import pandas as pd

tenant_id = 'xxxxxxxxxxxxxx'
username = 'xxxxxxxxxxxxxx'
password = 'xxxxxxxxxxxxxx''
store_name = 'xxxxxxxxxxxxxx'
token = lib.auth(tenant_id, username, password)

adls = core.AzureDLFileSystem(token, store_name=store_name)

with adls.open('/xxxxxx/xyz.csv', 'rb') as f:
df = pd.read_csv(f)

Show the dataframe

df

DatalakeRESTException: HTTP error: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='xxxxxxxxx.azuredatalakestore.net', port=443): Max retries exceeded with url: /webhdfs/v1/xxxxxxxx/xyz.csv?OP=GETFILESTATUS&api-version=2018-09-01 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6bbdd55c10>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))

Answer 3

PRADEEPCHEEKATLA 90,641 Moderator

Hello @Jayesh Dave ,

Note: This AzureDLFileSystem library supports ADLS Gen 1. For Gen 2, please see azure-storage-file-datalake, documented here.

Hope this helps. Do let us know if you any further queries.

------------

Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
Want a reminder to come back and check responses? Here is how to subscribe to a notification.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-02-17T12:19:18.567+00:00

Hello @Jayesh Dave ,
Just checking in to see if the above answer helped. If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.
Jayesh Dave 296 Reputation points

2021-02-18T00:05:43.95+00:00

Still working on it, I had power outages, will update you soon.

Answer 4

Hello Pradeep:

Thanks for y our reply.

Jupyter Notebook is local on my laptop. I am Group admin on Azure and I have required permission under ADLS Gen2 Container.

Below you will see that I tried 3 different ways\option to access ADLS Gen2 from my laptop jupyter notebook and encounter similar & different errors.

Again, any help with proper document link is greatly appreciated.

Option - 1
AccountName='xxxxxxxxxx'
AccountKey='/xxxxxxxxxx'
tenant_id = 'xxxxxxxxxx'
client_secret = 'xxxxxxxxxx'
client_id = 'xxxxxxxxxx'
input_blobpath = 'https://xxxxxxxxxx'

from azure.common.credentials import ServicePrincipalCredentials
token = lib.auth(tenant_id = tenant_id, client_secret = client_secret, client_id = client_id,require_2fa=False, \
resource='https://xxxxxxxxxx')

Get Token request returned http error: 400 and server response: {"error":"invalid_resource","error_description":"AADSTS500011: The resource principal named https://storageAccountName was not found in the tenant named xxxxxxxxxx. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You might have sent your authentication request to the wrong tenant.\r\nTrace ID: 1ac78efb-95a9-4317-b18d-1b3272036800\r\nCorrelation ID: xxxxxxxxxx\r\nTimestamp: 2021-02-15 17:18:57Z","error_codes":[500011],"timestamp":"2021-02-15 17:18:57Z","trace_id":"1ac78efb-95a9-4317-b18d-1b3272036800","correlation_id":"xxxxxxxxxx","error_uri":"https://login.microsoftonline.com/error?code=500011"}

Option - 2

adlsFileSystemClient = core.AzureDLFileSystem(token, store_name=xxxxxxxxxxxx)

Read a file into pandas dataframe

with adlsFileSystemClient.open('xxxxxxxxxxxxxxxxxxxx', 'rb') as f:
df = pd.read_csv(f)

Show the dataframe

df

DatalakeRESTException: HTTP error: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='kxxxxx.azuredatalakestore.net', port=443): Max retries exceeded with url: /webhdfs/v1/xxxxxxxxxxxxxx.csv?OP=GETFILESTATUS&api-version=2018-09-01 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f68387dafa0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))

Option - 3

tenant_id = 'xxxxxxxxxxxxxx'
client_secret = 'xxxxxxxxxxxxxx'
client_id = 'xxxxxxxxxxxxxx'
container_name = 'xxxxxxxxxxxxxx'
blob_name = 'xxxxxxxxxxxxxx.csv'

credentials = ServicePrincipalCredentials(client_id=client_id, secret=client_secret, tenant=tenant_id)

service = BlobServiceClient(account_url="https://xxxxxxxxx/", credential=credentials)

block_blob_service = BlobServiceClient(account_url='https://xxxxxxxxx',account_name=account_name, account_key=account_key)

csv_content = service.get_blob_to_text(container_name, blob_name).content
print(csv_content)

TypeError: Unsupported credential: <msrestazure.azure_active_directory.ServicePrincipalCredentials object at 0x7f683853f610>

Answer 5

Hello Pradeep.

Below is my code as per your suggestion and getting error.

Code:-

import os
import random
import uuid

from azure.storage.filedatalake import DataLakeServiceClient

account_name='kxxxxxxxxx'
account_key='/xxxxxxxx'

service_client = DataLakeServiceClient("https://xxxxxxx.dfs.core.windows.net", credential=account_key)

List file systems

[START list_file_systems]

file_systems = service_client.list_file_systems()
for file_system in file_systems:
print(file_system.name)

[END list_file_systems]

from azure.storage.filedatalake import FileSystemClient

Get the DataLakeDirectoryClient from the FileSystemClient to interact with a specific file

directory_client = FileSystemClient .get_directory_client("xxxxxxxxxx")

Error:-

ImportError Traceback (most recent call last)
<ipython-input-10-99dd9c21fe89> in <module>
3 import uuid
4
----> 5 from azure.storage.filedatalake import DataLakeServiceClient
6
7 account_name='xxxxxxxx'

~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/init.py in <module>
5 # --------------------------------------------------------------------------
6
----> 7 from ._download import StorageStreamDownloader
8 from ._data_lake_file_client import DataLakeFileClient
9 from ._data_lake_directory_client import DataLakeDirectoryClient

~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/_download.py in <module>
4 # license information.
5 # --------------------------------------------------------------------------
----> 6 from ._deserialize import from_blob_properties
7
8

~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/_deserialize.py in <module>
13 from azure.core.exceptions import HttpResponseError, DecodeError, ResourceModifiedError, ClientAuthenticationError, \
14 ResourceNotFoundError, ResourceExistsError
---> 15 from ._models import FileProperties, DirectoryProperties, LeaseProperties, PathProperties
16 from ._shared.models import StorageErrorCode
17

~/.local/lib/python3.8/site-packages/azure/storage/filedatalake/_models.py in <module>
9 from enum import Enum
10
---> 11 from azure.storage.blob import LeaseProperties as BlobLeaseProperties
12 from azure.storage.blob import AccountSasPermissions as BlobAccountSasPermissions
13 from azure.storage.blob import ResourceTypes as BlobResourceTypes

ImportError: cannot import name 'LeaseProperties' from 'azure.storage.blob' (unknown location)

Share via

Accessing csv file from ADLS Gen2 into Jupyter notebook.

8 additional answers

Show the dataframe

Read a file into pandas dataframe

Show the dataframe

block_blob_service = BlobServiceClient(account_url='https://xxxxxxxxx',account_name=account_name, account_key=account_key)

List file systems

[START list_file_systems]

[END list_file_systems]

Get the DataLakeDirectoryClient from the FileSystemClient to interact with a specific file

Your answer