Error importing parquet and excel file into Pyspark Notebook

Quentin Chiffoleau 21 Reputation points
2022-05-24T12:57:03.567+00:00

Hi everyone,

I'm trying to import parquet and excel files into the Pyspark notebook but I encounter the same probleme with both type of file. Here is the code I was trying to run :
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark=SparkSession.builder.appName("PySpark Read Parquet").getOrCreate()

Primary storage info

account_name = '' # fill in your primary account name
container_name = '' # fill in your container name
relative_path = '' # fill in your relative folder path

adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)
print('Primary storage account path: ' + adls_path )

The code works until there

spark.conf.set("fs.azure.account.auth.type.%s.dfs.core.windows.net" %account_name, "SharedKey")
spark.conf.set("fs.azure.account.key.%s.dfs.core.windows.net" %account_name ,"Your ADLS Gen2 Primary Key")
df1 = spark.read.parquet(adls_path + 'NDFGSE/NDFGSE.parquet')

Then i get this error message where only the number is changing : Py4JJavaError: An error occurred while calling o636.parquet.
: Failure to initialize configuration

I might think it's coming from the java configuration or the spark version but I'm not sure.

Thanks for the help,

Kind regards

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator
    2022-05-25T23:42:36.667+00:00

    Hello @Quentin Chiffoleau ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is to read a paraquet file which is stored in ADLS gen2 , please do let us know if its not accurate.

    I have tried the below code and it does work fine for me . I am not sure if you are using the SAS key or access key , I am using the access key below .
    You will have to update the account key , account name , container name and relative path to work

    import pyspark
    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    spark=SparkSession.builder.appName("PySpark Read Parquet").getOrCreate()
    account_name = 'accountName # fill in your primary account name
    container_name = 'himanshu' # fill in your container name
    relative_path = 'NYCTaxi/PassengerCountStats.parquet/part-00000-21161a2b-1c65-4a76-9999-0b2403785f46-c000.snappy.parquet' # fill in your relative folder path
    adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)
    spark.conf.set('fs.azure.account.key.%s.dfs.core.windows.net' %(account_name) ,"accesskey")
    df1 = spark.read.parquet(adls_path)

    205673-image.png

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.