Excel file in Synapse pyspark notebook. I am reading excel file in pandas and then converting to pyspark dataframe. Getting many error during pandas to pyspark dataframe conversions.

Question

Excel file in Synapse pyspark notebook. I am reading excel file in pandas and then converting to pyspark dataframe. Getting many error during pandas to pyspark dataframe conversions.

Prashant Panaskar 25

I am reading excel file from synapse pyspark notebook. I am using pandas read_excel() method as I was not able to find excel supported methods in pyspark. When I am converting pandas dataframe to pyspark dataframe I am getting data type errors. Error: /opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected a string or bytes dtype, got int64 Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_9124/2948791008.py in <module> 48 49---> 50 df = spark.createDataFrame(pdf2,schema = spark_schema)51display(df)52# df.columns TypeError: field Zip: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

This is due to pyarrow. As per this link, MS docs says not all data types are supported. Is there any workaround available to these issues?

import

# display(pdf1)
df = spark.createDataFrame(pdf1)
display(df)

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-02-26T07:50:04.6+00:00

@Prashant Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

2 additional answers

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-02-26T07:50:04.6+00:00

@Prashant Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @Prashant

Welcome to Microsoft Q&A platform and thanks for posting your question.

I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others "I'll repost your solution in case you'd like to accept the answer.

Ask: I am reading excel file from synapse pyspark notebook. I am using pandas read_excel() method as I was not able to find excel supported methods in pyspark. When I am converting pandas dataframe to pyspark dataframe I am getting data type errors. Error: /opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected a string or bytes dtype, got int64 Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_9124/2948791008.py in <module> 48 49---> 50 df = spark.createDataFrame(pdf2,schema = spark_schema)51display(df)52# df.columns TypeError: field Zip: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

This is due to pyarrow. As per this link, MS docs says not all data types are supported. Is there any workaround available to these issues?

import

Solution: As per MS docs this link, pandas and pyspark data types conversions was causing issues. I was able to read file successfully by following below steps.

Utilized pd.read_excel() to read the Excel file into a Pandas Dataframe.

PythonAI ConvertCopy

pdf = pd.read_excel(file_path,sheet_name = 'Sheet',skiprows=0,skipfooter=0)

To bring homogeneity across data types, I casted all columns to object data type first. Also pandas assign NaN to null cells. Handled NaN values by assigning None, aligning with PySpark's expectations for null values.

PythonAI ConvertCopy

pdf = pdf.astype(object).where(pd.notnull(pdf),None)
pd.info()

I checked data type of pandas df using pd.info() after this. All columns were inferred as object data type.

Then I tried setting data types in pandas dataframe itself. I was having integer columns, float and date columns. As per pandas docs, pd.to_datetime and pd.to_numeric is used to cast columns.

PythonAI ConvertCopy

numeric_columns = ['OREDER NO']
for col in numeric_columns:
    pdf[col] = pd.to_numeric(pdf[col])

date_columns = ['Date Shipped']
for col in date_columns:
    pdf[col] = pd.to_datetime(pdf[col])

float_columns = ['Selling Price']
for col in float_columns:
    pdf[col] = pdf[col].astype(float)

If null values are present, pandas infers int column as float. As per pandas docs and stackoverflow community, I understood that we can cast to again 'Int64' which supports null records. So few columns were giving me errors during float to int. I handled such columns by casting to 'Int64'. This was done for few files in case I ran error.

PythonAI ConvertCopy

pdf['Qty'] = pdf['Qty'].astype('Int64')

After this, I Validated data types using pd.info() to ensure alignment with the desired types. During type casting few columns were having again NaN / NA. So addressed nulls as None consistent with PySpark expectations.

PythonAI ConvertCopy

pdf = pdf.where(pd.notnull(pdf), None)
pdf.replace(pd.NA, None, inplace=True)

At this stage, my pandas dataframe was ready with data types similar to what I was expecting in pyspark. However in my case I wanted different column names in pyspark dataframe (col names without spaces). So I created spark_schema with StructType using pyspark.sql.types.

PythonAI ConvertCopy

spark_schema = StructType([
    StructField("order_no", IntegerType(), nullable=True),
    StructField("date_shipped", DateType(), nullable=True),
    StructField("selling_price", FloatType(), nullable=True)
])

Finally, I created pyspark dataframe using pandas df by spark.createDataFrame by passing schema as desired.

PythonAI ConvertCopy

df = spark.createDataFrame(pdf,schema=spark_schema)

I am writing steps in detailed way in case someone is facing similar issue while converting pandas df to pyspark df.

If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Answer 2

As per MS docs this link, pandas and pyspark data types conversions was causing issues. I was able to read file successfully by following below steps.

Utilized pd.read_excel() to read the Excel file into a Pandas Dataframe.

pdf = pd.read_excel(file_path,sheet_name = 'Sheet',skiprows=0,skipfooter=0)

To bring homogeneity across data types, I casted all columns to object data type first. Also pandas assign NaN to null cells. Handled NaN values by assigning None, aligning with PySpark's expectations for null values.

pdf = pdf.astype(object).where(pd.notnull(pdf),None)
pd.info()

I checked data type of pandas df using pd.info() after this. All columns were inferred as object data type.

Then I tried setting data types in pandas dataframe itself. I was having integer columns, float and date columns. As per pandas docs, pd.to_datetime and pd.to_numeric is used to cast columns.

numeric_columns = ['OREDER NO']
for col in numeric_columns:
    pdf[col] = pd.to_numeric(pdf[col])

date_columns = ['Date Shipped']
for col in date_columns:
    pdf[col] = pd.to_datetime(pdf[col])

float_columns = ['Selling Price']
for col in float_columns:
    pdf[col] = pdf[col].astype(float)

If null values are present, pandas infers int column as float. As per pandas docs and stackoverflow community, I understood that we can cast to again 'Int64' which supports null records. So few columns were giving me errors during float to int. I handled such columns by casting to 'Int64'. This was done for few files in case I ran error.

pdf['Qty'] = pdf['Qty'].astype('Int64')

After this, I Validated data types using pd.info() to ensure alignment with the desired types. During type casting few columns were having again NaN / NA. So addressed nulls as None consistent with PySpark expectations.

pdf = pdf.where(pd.notnull(pdf), None)
pdf.replace(pd.NA, None, inplace=True)

At this stage, my pandas dataframe was ready with data types similar to what I was expecting in pyspark. However in my case I wanted different column names in pyspark dataframe (col names without spaces). So I created spark_schema with StructType using pyspark.sql.types.

spark_schema = StructType([
    StructField("order_no", IntegerType(), nullable=True),
    StructField("date_shipped", DateType(), nullable=True),
    StructField("selling_price", FloatType(), nullable=True)
])

Finally, I created pyspark dataframe using pandas df by spark.createDataFrame by passing schema as desired.

df = spark.createDataFrame(pdf,schema=spark_schema)

I am writing steps in detailed way in case someone is facing similar issue while converting pandas df to pyspark df.

Answer 3

You can try following:

Disable arrow optimization - as a quick workaround, you can disable Arrow-based conversion to see if that helps. This can be done by setting the configuration spark.sql.execution.arrow.pyspark.enabled to false. However, keep in mind that this may lead to less efficient data conversion.

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") df = spark.createDataFrame(pdf2, schema=spark_schema)

Check data types - make sure that the data types in your Pandas DataFrame match the schema you are defining for the PySpark DataFrame. For instance, if the Spark schema expects a string and the Pandas DataFrame has integers (as suggested by the error message for the field 'Zip'), you'll need to convert the column to string dtype in Pandas pdf2['Zip'] = pdf2['Zip'].astype(str) df = spark.createDataFrame(pdf2, schema=spark_schema)

Prashant Panaskar 25 Reputation points

2024-02-28T17:57:58.1533333+00:00

Thanks @Andrzej Strzała I tried setting pyarrow as false using spark conf set. But it was raising error. However I resolved issue for me. Please refer comment

Share via

Excel file in Synapse pyspark notebook. I am reading excel file in pandas and then converting to pyspark dataframe. Getting many error during pandas to pyspark dataframe conversions.

2 additional answers

Your answer