Pandas dataframe to_parquet stops working in Databricks runtime 10.2 (Apache Spark 3.2.0, Scala 2.12)

Question

Pandas dataframe to_parquet stops working in Databricks runtime 10.2 (Apache Spark 3.2.0, Scala 2.12)

Joseph Chen 21

The following codes were working fine earlier but they stop working today. It creates no parquet file and causes error when I try to read it back. As a workaround, I convert the pandas dataframe to Spark dataframe to export the data. Anybody knows why it stops working? Was there a recent pandas update to cause this to break?

df.to_parquet('/mnt/inbox/test.parquet')

Attached is a sample test codes to show the problem.174174-missing-pandas-parquet-output-databricks.pdf

The codes are enclosed below as well to replicate the problem.

# Databricks notebook source  
# Import pandas library  
import pandas as pd  
  
# initialize list of lists  
data = [['tom', 10], ['nick', 15], ['juli', 14]]  
  
# Create the pandas DataFrame  
df = pd.DataFrame(data, columns = ['Name', 'Age'])  
  
# print dataframe.  
df  
  
# COMMAND ----------  
  
# found no test.pqrquet before writing pandas dataframe  
dbutils.fs.ls('/mnt/inbox')  
  
# COMMAND ----------  
  
# write pandas dataframe  
df.to_parquet('/dbfs/mnt/inbox/test.parquet')  
  
# COMMAND ----------  
  
# found no output file  
dbutils.fs.ls('/mnt/inbox')  
  
# COMMAND ----------  
  
# found no output file  
dbutils.fs.ls('/dbfs/mnt/inbox')  
  
# COMMAND ----------  
  
# convert pandas dataframe to spark dataframe and export to file  
sparkDF = spark.createDataFrame(df)  
sparkDF.write.parquet('/mnt/inbox/test.parquet')  
  
# COMMAND ----------  
  
# found output file  
dbutils.fs.ls('/mnt/inbox')  
  
# COMMAND ----------  
  
# clean up testing  
dbutils.fs.rm('/mnt/inbox/test.parquet', recurse=True)  
dbutils.fs.ls('/mnt/inbox')

HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2022-02-16T00:21:12.71+00:00

Hello @Joseph Chen ,
Thanks for the ask and using the Microsoft Q&A platform .
Can you please share the error which you are getting ?
Thanks
Himanshu

1 answer

Your answer

HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2022-02-16T00:21:12.71+00:00

Hello @Joseph Chen ,
Thanks for the ask and using the Microsoft Q&A platform .
Can you please share the error which you are getting ?
Thanks
Himanshu

Answer 1

Hello @Joseph Chen ,
Thanks for the ask and using Microsoft Q&A platform .
As we understand the ask here is why are you getting an error while running the command

df.to_parquet('/dbfs/mnt/inbox/test.parquet')

I am assuming that you are getting the below error

FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/inbox/test.parquet'

Please do let me know if that is not accurate,

When I tried to repro the issue i got the aboev error . When checking the dbfs i do not had the path "/dbfs/mnt/inbox/test.parquet" but i did had the path "/dbfs/mnt/test.parquet"

So the below code just works fine for me

Databricks notebook source

Import pandas library

import pandas as pd

initialize list of lists

data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df
df.to_parquet('/dbfs/mnt/test.parquet')

I am confident that the path which you are refering does not exists .

Please do let me if you have any queries .
Thanks
Himanshu

-------------------------------------------------------------------------------------------------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Joseph Chen 21 Reputation points

2022-02-16T16:54:03.7+00:00

Yes, the error was file not found. The "inbox" folder was created in my Azure storage account; the storage account is then mounted at "\mnt"; so, the path exists as demonstrated in my PDF attachment earlier.

In your case, you could export the pandas data frame directly without the "inbox" folder if you do not have it. The "\dbfs" is needed for to_parque function to find the mount path.

So far, the problem is still there and I have the following libraries installed so I wonder if any of them could impact how pandas works.

Share via

Pandas dataframe to_parquet stops working in Databricks runtime 10.2 (Apache Spark 3.2.0, Scala 2.12)

1 answer

Databricks notebook source

Import pandas library

initialize list of lists

Your answer