Pandas dataframe to_parquet stops working in Databricks runtime 10.2 (Apache Spark 3.2.0, Scala 2.12)

Joseph Chen 21 Reputation points
2022-02-14T17:50:34.5+00:00

The following codes were working fine earlier but they stop working today. It creates no parquet file and causes error when I try to read it back. As a workaround, I convert the pandas dataframe to Spark dataframe to export the data. Anybody knows why it stops working? Was there a recent pandas update to cause this to break?

df.to_parquet('/mnt/inbox/test.parquet')

Attached is a sample test codes to show the problem.174174-missing-pandas-parquet-output-databricks.pdf

The codes are enclosed below as well to replicate the problem.

# Databricks notebook source  
# Import pandas library  
import pandas as pd  
  
# initialize list of lists  
data = [['tom', 10], ['nick', 15], ['juli', 14]]  
  
# Create the pandas DataFrame  
df = pd.DataFrame(data, columns = ['Name', 'Age'])  
  
# print dataframe.  
df  
  
# COMMAND ----------  
  
# found no test.pqrquet before writing pandas dataframe  
dbutils.fs.ls('/mnt/inbox')  
  
# COMMAND ----------  
  
# write pandas dataframe  
df.to_parquet('/dbfs/mnt/inbox/test.parquet')  
  
# COMMAND ----------  
  
# found no output file  
dbutils.fs.ls('/mnt/inbox')  
  
# COMMAND ----------  
  
# found no output file  
dbutils.fs.ls('/dbfs/mnt/inbox')  
  
# COMMAND ----------  
  
# convert pandas dataframe to spark dataframe and export to file  
sparkDF = spark.createDataFrame(df)  
sparkDF.write.parquet('/mnt/inbox/test.parquet')  
  
# COMMAND ----------  
  
# found output file  
dbutils.fs.ls('/mnt/inbox')  
  
# COMMAND ----------  
  
# clean up testing  
dbutils.fs.rm('/mnt/inbox/test.parquet', recurse=True)  
dbutils.fs.ls('/mnt/inbox')  
  
  
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator
    2022-02-16T00:56:16.967+00:00

    Hello @Joseph Chen ,
    Thanks for the ask and using Microsoft Q&A platform .
    As we understand the ask here is why are you getting an error while running the command

    df.to_parquet('/dbfs/mnt/inbox/test.parquet')

    I am assuming that you are getting the below error

    FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/inbox/test.parquet'

    Please do let me know if that is not accurate,

    When I tried to repro the issue i got the aboev error . When checking the dbfs i do not had the path "/dbfs/mnt/inbox/test.parquet" but i did had the path "/dbfs/mnt/test.parquet"
    174648-image.png

    So the below code just works fine for me

    Databricks notebook source

    Import pandas library

    import pandas as pd

    initialize list of lists

    data = [['tom', 10], ['nick', 15], ['juli', 14]]
    df = pd.DataFrame(data, columns = ['Name', 'Age'])
    df
    df.to_parquet('/dbfs/mnt/test.parquet')

    I am confident that the path which you are refering does not exists .

    Please do let me if you have any queries .
    Thanks
    Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.