Excel file in Synapse pyspark notebook. I am reading excel file in pandas and then converting to pyspark dataframe. Getting many error during pandas to pyspark dataframe conversions.

Prashant Panaskar 25 Reputation points
2024-02-24T18:28:38.57+00:00

I am reading excel file from synapse pyspark notebook. I am using pandas read_excel() method as I was not able to find excel supported methods in pyspark. When I am converting pandas dataframe to pyspark dataframe I am getting data type errors. Error: /opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected a string or bytes dtype, got int64 Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_9124/2948791008.py in <module> 48 49---> 50 df = spark.createDataFrame(pdf2,schema = spark_schema)51display(df)52# df.columns TypeError: field Zip: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

This is due to pyarrow. As per this link, MS docs says not all data types are supported. Is there any workaround available to these issues?


import

# display(pdf1)
df = spark.createDataFrame(pdf1)
display(df)
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,374 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2024-02-29T03:53:34.1266667+00:00

    Hi @Prashant

    Welcome to Microsoft Q&A platform and thanks for posting your question.

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others "I'll repost your solution in case you'd like to accept the answer.

    Ask: I am reading excel file from synapse pyspark notebook. I am using pandas read_excel() method as I was not able to find excel supported methods in pyspark. When I am converting pandas dataframe to pyspark dataframe I am getting data type errors. Error: /opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected a string or bytes dtype, got int64 Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_9124/2948791008.py in <module> 48 49---> 50 df = spark.createDataFrame(pdf2,schema = spark_schema)51display(df)52# df.columns TypeError: field Zip: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

    This is due to pyarrow. As per this link, MS docs says not all data types are supported. Is there any workaround available to these issues?

    import
    

    Solution: As per MS docs this link, pandas and pyspark data types conversions was causing issues. I was able to read file successfully by following below steps.

    1. Utilized pd.read_excel() to read the Excel file into a Pandas Dataframe.

    PythonAI ConvertCopy

    pdf = pd.read_excel(file_path,sheet_name = 'Sheet',skiprows=0,skipfooter=0)
    
    1. To bring homogeneity across data types, I casted all columns to object data type first. Also pandas assign NaN to null cells. Handled NaN values by assigning None, aligning with PySpark's expectations for null values.

    PythonAI ConvertCopy

    pdf = pdf.astype(object).where(pd.notnull(pdf),None)
    pd.info()
    

    I checked data type of pandas df using pd.info() after this. All columns were inferred as object data type.

    1. Then I tried setting data types in pandas dataframe itself. I was having integer columns, float and date columns. As per pandas docs, pd.to_datetime and pd.to_numeric is used to cast columns.

    PythonAI ConvertCopy

    numeric_columns = ['OREDER NO']
    for col in numeric_columns:
        pdf[col] = pd.to_numeric(pdf[col])
    
    date_columns = ['Date Shipped']
    for col in date_columns:
        pdf[col] = pd.to_datetime(pdf[col])
    
    float_columns = ['Selling Price']
    for col in float_columns:
        pdf[col] = pdf[col].astype(float)
    
    
    1. If null values are present, pandas infers int column as float. As per pandas docs and stackoverflow community, I understood that we can cast to again 'Int64' which supports null records. So few columns were giving me errors during float to int. I handled such columns by casting to 'Int64'. This was done for few files in case I ran error.

    PythonAI ConvertCopy

    pdf['Qty'] = pdf['Qty'].astype('Int64')
    
    1. After this, I Validated data types using pd.info() to ensure alignment with the desired types. During type casting few columns were having again NaN / NA. So addressed nulls as None consistent with PySpark expectations.

    PythonAI ConvertCopy

    pdf = pdf.where(pd.notnull(pdf), None)
    pdf.replace(pd.NA, None, inplace=True)
    
    1. At this stage, my pandas dataframe was ready with data types similar to what I was expecting in pyspark. However in my case I wanted different column names in pyspark dataframe (col names without spaces). So I created spark_schema with StructType using pyspark.sql.types.

    PythonAI ConvertCopy

    spark_schema = StructType([
        StructField("order_no", IntegerType(), nullable=True),
        StructField("date_shipped", DateType(), nullable=True),
        StructField("selling_price", FloatType(), nullable=True)
    ])
    
    1. Finally, I created pyspark dataframe using pandas df by spark.createDataFrame by passing schema as desired.

    PythonAI ConvertCopy

    df = spark.createDataFrame(pdf,schema=spark_schema)
    

    I am writing steps in detailed way in case someone is facing similar issue while converting pandas df to pyspark df.

    If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

    If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Prashant Panaskar 25 Reputation points
    2024-02-28T20:16:27.79+00:00

    As per MS docs this link, pandas and pyspark data types conversions was causing issues. I was able to read file successfully by following below steps.

    1. Utilized pd.read_excel() to read the Excel file into a Pandas Dataframe.
    pdf = pd.read_excel(file_path,sheet_name = 'Sheet',skiprows=0,skipfooter=0)
    
    1. To bring homogeneity across data types, I casted all columns to object data type first. Also pandas assign NaN to null cells. Handled NaN values by assigning None, aligning with PySpark's expectations for null values.
    pdf = pdf.astype(object).where(pd.notnull(pdf),None)
    pd.info()
    

    I checked data type of pandas df using pd.info() after this. All columns were inferred as object data type.

    1. Then I tried setting data types in pandas dataframe itself. I was having integer columns, float and date columns. As per pandas docs, pd.to_datetime and pd.to_numeric is used to cast columns.
    numeric_columns = ['OREDER NO']
    for col in numeric_columns:
        pdf[col] = pd.to_numeric(pdf[col])
    
    date_columns = ['Date Shipped']
    for col in date_columns:
        pdf[col] = pd.to_datetime(pdf[col])
    
    float_columns = ['Selling Price']
    for col in float_columns:
        pdf[col] = pdf[col].astype(float)
    
    
    1. If null values are present, pandas infers int column as float. As per pandas docs and stackoverflow community, I understood that we can cast to again 'Int64' which supports null records. So few columns were giving me errors during float to int. I handled such columns by casting to 'Int64'. This was done for few files in case I ran error.
    pdf['Qty'] = pdf['Qty'].astype('Int64')
    
    1. After this, I Validated data types using pd.info() to ensure alignment with the desired types. During type casting few columns were having again NaN / NA. So addressed nulls as None consistent with PySpark expectations.
    pdf = pdf.where(pd.notnull(pdf), None)
    pdf.replace(pd.NA, None, inplace=True)
    
    1. At this stage, my pandas dataframe was ready with data types similar to what I was expecting in pyspark. However in my case I wanted different column names in pyspark dataframe (col names without spaces). So I created spark_schema with StructType using pyspark.sql.types.
    spark_schema = StructType([
        StructField("order_no", IntegerType(), nullable=True),
        StructField("date_shipped", DateType(), nullable=True),
        StructField("selling_price", FloatType(), nullable=True)
    ])
    
    1. Finally, I created pyspark dataframe using pandas df by spark.createDataFrame by passing schema as desired.
    df = spark.createDataFrame(pdf,schema=spark_schema)
    

    I am writing steps in detailed way in case someone is facing similar issue while converting pandas df to pyspark df.

    1 person found this answer helpful.
    0 comments No comments

  2. Andrzej Strzała 0 Reputation points
    2024-02-25T21:35:40.8433333+00:00

    You can try following:

    Disable arrow optimization - as a quick workaround, you can disable Arrow-based conversion to see if that helps. This can be done by setting the configuration spark.sql.execution.arrow.pyspark.enabled to false. However, keep in mind that this may lead to less efficient data conversion.

    spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") df = spark.createDataFrame(pdf2, schema=spark_schema)

    Check data types - make sure that the data types in your Pandas DataFrame match the schema you are defining for the PySpark DataFrame. For instance, if the Spark schema expects a string and the Pandas DataFrame has integers (as suggested by the error message for the field 'Zip'), you'll need to convert the column to string dtype in Pandas pdf2['Zip'] = pdf2['Zip'].astype(str) df = spark.createDataFrame(pdf2, schema=spark_schema)


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.