Does spark df maintain row order after transforming to pandas df?

Dipesh Yogi 1 Reputation point
2022-12-14T04:04:21.877+00:00

Hi All,

I have a code snippet where we are transforming a spark df to pandas df, extracting one column and then transforming it to a list of values.
eg:
embed_col= list(df.select('embed_col').toPandas()['embed_col'])
embed_col= np.float32(np.stack(embed_col))

What I observed is that sometimes the order of rows are not maintained causing inconsistent results as the order is important for later calculations which depends on the indexes of the elements.

Another observation was that this does not happen every time and happens randomly.

Any suggestions or remarks on this behavior would be really helpful.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2022-12-14T22:56:21.087+00:00

    Hello @Dipesh Yogi ,

    Thanks for the question and using MS Q&A platform.

    Spark Dataframe unlike a Pandas Dataframe does not have row_index. So when you take a distributed spark dataframe and then extract a column and convert it to a list the order or rows is not guaranteed.

    In order to get the order, you will have to first create a Spark Dataframe with the desired column and an index column (using an existing index column or create an index column if one does not exist). Create a pandas dataframe from the spark dataframe and sort it on the index column and set the pandas index to the sorted index column, after that you can create the list and hopefully your output will be an ordered list.

    Hope this will help.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.