Thanks for using MS Q&A platform and posting your query.
Yes, the duration you’re experiencing can be considered normal when working with large datasets in PySpark, especially when compared to operations in pandas dataframes. PySpark operates in a distributed manner, meaning it distributes the data across multiple nodes and performs operations in parallel. This can lead to longer execution times for certain operations, especially on larger datasets.
However, there are several ways to optimize the performance of your PySpark operations.
- Use DataFrame/Dataset over RDD: DataFrame and Dataset include several optimization modules to improve the performance of Spark workloads.
- Avoid UDF’s (User Defined Functions): These can be expensive operations.
- Caching Data In Memory: Spark SQL can cache tables using an in-memory columnar format.
- Reduce expensive Shuffle operations: Shuffling data across the network can be costly.
- Use coalesce() over repartition(): The
coalesce()
method can be used to reduce the number of partitions in the DataFrame.
Remember, the performance can also depend on the resources available in your Spark environment, such as the number of cores and the amount of memory.
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.