pyspark dataframe methods: count, show ... duration?

Question

pyspark dataframe methods: count, show ... duration?

Jon Z 140

Hi,

I am testing some pyspark methods over a dataframe that I have created from a table, from the dedicated pool and it about 32 million rows length

When running for example:

"df1.show()"

or

"df1.count()"

both commands last for about 2 or more minutes

Is this duration a normal behaivour from a notebook? (I am used to jupyter notebooks interacting whity pandas dataframes instead)

Thanks,

Accepted answer

0 additional answers

Your answer

Answer 1

@Jon Z

Thanks for using MS Q&A platform and posting your query.

Yes, the duration you’re experiencing can be considered normal when working with large datasets in PySpark, especially when compared to operations in pandas dataframes. PySpark operates in a distributed manner, meaning it distributes the data across multiple nodes and performs operations in parallel. This can lead to longer execution times for certain operations, especially on larger datasets.

However, there are several ways to optimize the performance of your PySpark operations.

Use DataFrame/Dataset over RDD: DataFrame and Dataset include several optimization modules to improve the performance of Spark workloads.
Avoid UDF’s (User Defined Functions): These can be expensive operations.
Caching Data In Memory: Spark SQL can cache tables using an in-memory columnar format.
Reduce expensive Shuffle operations: Shuffling data across the network can be costly.
Use coalesce() over repartition(): The coalesce() method can be used to reduce the number of partitions in the DataFrame.

Remember, the performance can also depend on the resources available in your Spark environment, such as the number of cores and the amount of memory.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Jon Z 140 Reputation points

2024-05-28T11:04:55.21+00:00

thanks for the reply, doing the same testing in a notebook in fabric, duration is similiar to python on pandas

Share via

pyspark dataframe methods: count, show ... duration?

0 additional answers

Your answer