Changes to file ingestion methodology Pyspark Azure Synapse from Data Lake.

Question

Hi we have been experiencing some peculiarities in reading files from within the connected storage of Synapse Analytics over the last few days. These have been experienced while using pyspark in Notebooks.
Our initial issue started using the pandas method read_excel. We used this method to read an excel sheet using this code which worked on Monday 2022-02-21.

import pandas as pd  
adslpath ='abfss://STORAGEACCOUNT.dfs.core.windows.net/CONTAINER/AssetManagement/Bronze/Raw/NOJV/FILE.xlsx'  
pdf = pd.read_excel(adslpath, sheet_name='Page1', nrows=60,usecols='A:T')

This stopped working on Wednesday (2022-02-23) giving the error:

FileNotFoundError: [Errno 2] No such file or directory: 'abfss://STORAGEACCOUNT@MetContainer .dfs.core.windows.net/AssetManagement/Bronze/Raw/NOJV/FILE.xlsx'

We found we could connect to this with an Https link to the same source e.g.,

pdf = pd.read_excel(  
    r'https://CONTAINER.dfs.core.windows.net/dafdlfsd01/AssetManagement/Bronze/Raw/NOJV//FILE.xlsx?sv=2020-08-04&ss=bfqt&srt=sco&sp=rwdlacupx&se=2024-12-01T17:44:05Z&st=2022-02-24T09:44:05Z&spr=https&sig=NnT1Qv8uXMkgh3RQKBOvJ%2Bch%2BVtI6GF7r4ZSdTMEOg8%3D', sheet_name='Page1', nrows=60,usecols='A:T')

Note the use of a Shared Access Signature which does add complexity as this will need to be refreshed periodically.

While content that we have an approach that would work we continued our development which gets the files we want to read using a loop of files utilizing the method mssparkutils.fs.ls("Your directory path") as of yesterday - 2022-02-24 - we started experiencing issues with this method. It turns out this stopped accepting URL but would accept abfss links.

from notebookutils import mssparkutils  
adslpath ='abfss://CONTAINER.dfs.core.windows.net/STORAGEACCOUNT/AssetManagement/Bronze/Raw/NOJV//'  
# abfsspath = [file for file in  mssparkutils.fs.ls(adslpath)]  
# folder = [file for file in  mssparkutils.fs.ls('https://CONTAINER.dfs.core.windows.net/STORAGEACCOUNT/AssetManagement/Bronze/Raw/NOJV//')]  
folder = [file for file in  mssparkutils.fs.ls('https://CONTAINER.blob.core.windows.net/STORAGEACCOUNT/AssetManagement/Bronze/Raw/NOJV//')]

Since this morning we now find each implementation of this method fails.

We have also noticed some changes to the nature of the display within synapse of the file explorer - though only for some users.

This has been replaced by this:

The big change here seems to be the introduction of the URI and the loss of the ABFSS Path and URL. This points to a shift to this from the older approach to the new approach.

We have a implementation with a range of clients which may now be in the process of breaking and have not see any indication of changes coming to this. The loss of the functionality in Microsoft Spark Utilities in particular is troubling. I have a number of questions.

Is there some update we have missed and guidance about how to proceed?
Will URI work using managed identity or will we need to use SAS to access files?
Will pandas and notebookutils update to new versions or are there actions we will need to take or are they irrevocably broken?

Many thanks.
Stephen Connell.

Answer

Hi @Stephen Connell ,

Thank you for posting query in Microsoft Q&A Platform.

As I understand ask here pandas read excel and MS spark file system utilities are not working as expected. When you used abfss path then you see pandas is not working and for MS spark file system utilities also throwing errors. Please correct me if my understanding is incorrect.

I tried to reproduce the scenarios and in my case all working fine. Please check below details and screenshots.

I have used spark version 3.1 and i could see it has pandas version of 1.2.3 in it.

I also tried MS spark file system utilities and it also worked fine for me.

Below are the errors which you have mentioned. Please check my comments or thoughts on them below.

Error: FileNotFoundError: [Errno 2] No such file or directory: 'abfss://STORAGEACCOUNT@CONTAINER.dfs.core.windows.net/AssetManagement/Bronze/Raw/NOJV/FILE.xlsx'
From the error message it seems you are trying to access file, may be that file or directory not available. Kindly re-check once.

Error: Py4JJavaError: An error occurred while calling z:mssparkutils.fs.ls. : abfss://STORAGEACCOUNT.dfs.core.windows.net has invalid authority.
From above message it seems like we are missing permissions here. Could you please make sure synapse MSI and the AAD account which connected to Synapse workspace has Storage Blob data contributor role on the storage account.

Changes to file ingestion methodology Pyspark Azure Synapse from Data Lake.

1 answer