Hi @Prashant
Welcome to Microsoft Q&A platform and thanks for posting your question.
I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others "I'll repost your solution in case you'd like to accept the answer.
Ask: I am reading excel file from synapse pyspark notebook. I am using pandas read_excel() method as I was not able to find excel supported methods in pyspark. When I am converting pandas dataframe to pyspark dataframe I am getting data type errors. Error: /opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected a string or bytes dtype, got int64 Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_9124/2948791008.py in <module>
48
49
---> 50 df = spark.createDataFrame(pdf2,schema = spark_schema)
51
display(df)
52
# df.columns
TypeError: field Zip: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>
This is due to pyarrow. As per this link, MS docs says not all data types are supported. Is there any workaround available to these issues?
import
Solution: As per MS docs this link, pandas and pyspark data types conversions was causing issues. I was able to read file successfully by following below steps.
- Utilized pd.read_excel() to read the Excel file into a Pandas Dataframe.
PythonAI ConvertCopy
pdf = pd.read_excel(file_path,sheet_name = 'Sheet',skiprows=0,skipfooter=0)
- To bring homogeneity across data types, I casted all columns to object data type first. Also pandas assign NaN to null cells. Handled NaN values by assigning None, aligning with PySpark's expectations for null values.
PythonAI ConvertCopy
pdf = pdf.astype(object).where(pd.notnull(pdf),None)
pd.info()
I checked data type of pandas df using pd.info() after this. All columns were inferred as object data type.
- Then I tried setting data types in pandas dataframe itself. I was having integer columns, float and date columns. As per pandas docs, pd.to_datetime and pd.to_numeric is used to cast columns.
PythonAI ConvertCopy
numeric_columns = ['OREDER NO']
for col in numeric_columns:
pdf[col] = pd.to_numeric(pdf[col])
date_columns = ['Date Shipped']
for col in date_columns:
pdf[col] = pd.to_datetime(pdf[col])
float_columns = ['Selling Price']
for col in float_columns:
pdf[col] = pdf[col].astype(float)
- If null values are present, pandas infers int column as float. As per pandas docs and stackoverflow community, I understood that we can cast to again 'Int64' which supports null records. So few columns were giving me errors during float to int. I handled such columns by casting to 'Int64'. This was done for few files in case I ran error.
PythonAI ConvertCopy
pdf['Qty'] = pdf['Qty'].astype('Int64')
- After this, I Validated data types using pd.info() to ensure alignment with the desired types. During type casting few columns were having again NaN / NA. So addressed nulls as None consistent with PySpark expectations.
PythonAI ConvertCopy
pdf = pdf.where(pd.notnull(pdf), None)
pdf.replace(pd.NA, None, inplace=True)
- At this stage, my pandas dataframe was ready with data types similar to what I was expecting in pyspark. However in my case I wanted different column names in pyspark dataframe (col names without spaces). So I created spark_schema with StructType using pyspark.sql.types.
PythonAI ConvertCopy
spark_schema = StructType([
StructField("order_no", IntegerType(), nullable=True),
StructField("date_shipped", DateType(), nullable=True),
StructField("selling_price", FloatType(), nullable=True)
])
- Finally, I created pyspark dataframe using pandas df by spark.createDataFrame by passing schema as desired.
PythonAI ConvertCopy
df = spark.createDataFrame(pdf,schema=spark_schema)
I am writing steps in detailed way in case someone is facing similar issue while converting pandas df to pyspark df.
If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.
If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.
Please don’t forget to Accept Answer
and Yes
for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.