How to connect to ADLS using spark-shell (not through databricks)

Bhaskar 0 Reputation points
2024-02-12T21:39:05.1666667+00:00

I have downloaded and using spark-shell (built into apache spark binaries) to set up a spark session and read/write to ADLS. I have a storage account set up with secret and have a single sitting in blob storage. I would like to read that using abfss protocol. So, per documentation at https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage, i have set spark configuration as here. spark.conf.set("fs.azure.account.auth.type."+storageAccountName+".dfs.core.windows.net","OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type."+storageAccountName+".dfs.core.windows.net","org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id."+storageAccountName+".dfs.core.windows.net", clientId)
spark.conf.set("fs.azure.account.oauth2.client.secret."+storageAccountName+".dfs.core.windows.net", clientSecret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint."+storageAccountName+".dfs.core.windows.net", "https://login.microsoftonline.com/"+tenantId+"/oauth2/token")"https://login.microsoftonline.com/%22+tenantid+%22/oauth2/token%22)") Where i have the values for storageAccountName, clientId, clientSecret in variables. When i try to read my file from storage val df = spark.read.parquet("abfss://<<mycontainer>>@<<storageAccountName>>.dfs.core.windows.net/wordcount.txt") i get the following error. WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: abfss://<<mycontainer>>@<<storageAccountName>>.dfs.core.windows.net/wordcount.txt
Invalid configuration value detected for fs.azure.account.key Did anyone run into this type of a problem? I am guessing that databricks or not, a spark session should be the same. I am not sure what i am missing and appreciate any insights into this problem and how to get around it. Thanks in advance.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,426 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,639 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 85,586 Reputation points Microsoft Employee
    2024-02-13T04:24:43.6866667+00:00

    @Bhaskar - Thanks for the question and using MS Q&A platform.

    It seems like you are trying to read a file from ADLS using spark-shell. Based on the error message you provided, it seems like there is an issue with the configuration values you have set for the storage account.

    Here are a few things you can check to resolve the issue: Make sure that the values you have set for storageAccountName, clientId, clientSecret, and tenantId are correct and correspond to your ADLS account. Check if you have set the correct configuration values for the storage account. You can try setting the configuration values using the following code:

    spark.conf.set("fs.azure.account.auth.type.<your-storage-account-name>.dfs.core.windows.net", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type.<your-storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id.<your-storage-account-name>.dfs.core.windows.net", "<your-client-id>")
    spark.conf.set("fs.azure.account.oauth2.client.secret.<your-storage-account-name>.dfs.core.windows.net", "<your-client-secret>")
    spark.conf.set("fs.azure.account.oauth2.client.endpoint.<your-storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<your-tenant-id>/oauth2/token")
    
    

    Make sure that you have the correct permissions to access the file in ADLS. You can check the permissions by going to the Azure portal and checking the access control settings for the file.

    For more details, refer to the below links:
    https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Configuring_ABFS https://deep.data.blog/2019/07/12/diy-apache-spark-and-adls-gen-2-support/

    If you have checked all of the above and still face the issue, please let me know and I can help you further.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.