pyspark dataframe methods: count, show ... duration?

Jon Z 140 Reputation points
2024-05-28T10:52:37.5833333+00:00

Hi,

I am testing some pyspark methods over a dataframe that I have created from a table, from the dedicated pool and it about 32 million rows length

When running for example:

"df1.show()"

or

"df1.count()"

both commands last for about 2 or more minutes

Is this duration a normal behaivour from a notebook? (I am used to jupyter notebooks interacting whity pandas dataframes instead)

Thanks,

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
0 comments No comments
{count} votes

Accepted answer
  1. phemanth 15,755 Reputation points Microsoft External Staff Moderator
    2024-05-28T11:00:59.82+00:00

    @Jon Z

    Thanks for using MS Q&A platform and posting your query.

    Yes, the duration you’re experiencing can be considered normal when working with large datasets in PySpark, especially when compared to operations in pandas dataframes. PySpark operates in a distributed manner, meaning it distributes the data across multiple nodes and performs operations in parallel. This can lead to longer execution times for certain operations, especially on larger datasets.

    However, there are several ways to optimize the performance of your PySpark operations.

    1. Use DataFrame/Dataset over RDD: DataFrame and Dataset include several optimization modules to improve the performance of Spark workloads.
    2. Avoid UDF’s (User Defined Functions): These can be expensive operations.
    3. Caching Data In Memory: Spark SQL can cache tables using an in-memory columnar format.
    4. Reduce expensive Shuffle operations: Shuffling data across the network can be costly.
    5. Use coalesce() over repartition(): The coalesce() method can be used to reduce the number of partitions in the DataFrame.

    Remember, the performance can also depend on the resources available in your Spark environment, such as the number of cores and the amount of memory.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.