Indexing a Pyspark dataframe

Question

Indexing a Pyspark dataframe

Varun S Kumar 50

Hey guys,

I am having a very large dataset as multiple parquets (like around 20,000 small files) which I am reading into a pyspark dataframe. I want to add an index column in this dataframe and then do some data profiling and data quality check activities. I'm sharing a portion of the code.
I've tried both monotonically_increasing_id as well as zipWithIndex. I've seen in every forums that zipWithIndex is best for performance but for me it's the other way around. Following is my benchmarks for indexing the table using both:

Parquet Size: 1.3 GB (around 15 GB if its in CSV format)
Total Row Count: 1466764
Total Column Count: 900

Time taken for mono_id: 0.10401535034179688 seconds
mono_df.rdd.getNumPartitions = 1

Time taken for zip_id: 250.9147379398346 seconds
zip_df.rdd.getNumPartitions = 350

You can see that monotonically_increasing_id was done in split seconds while zipWithIndex took more than 4 minutes. But the number of partitions after monotonically_increasing_id came down to just one. The original dataframe while reading had 350 which was maintained after zipWithIndex.

Now after indexing, while running my profiling code, the mono_df takes an average of 15 seconds to profile a column while zip_df takes an average of 30 minutes.

I'm a newbie on pyspark and databricks. What am I doing wrong here and how can I increase the performance?

spark = SparkSession.builder.appName("Example").getOrCreate()
partition_size = spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b","")
print(f"Partition Size: {int(partition_size) / 1024 / 1024} MB")

df_no_schema = spark.read.parquet('dbfs:parquet_folder/')
print(f"Number of Partition: {df_no_schema.rdd.getNumPartitions()}")
print(df_no_schema.count())

columns = df_no_schema.columns
row_with_index = Row(*columns, "index")

def create_new_schema(df_no_schema):
    new_schema = StructType(df_no_schema.schema.fields[:] + [StructField("index", LongType(), False)])
    return new_schema

def zip_rdd(df_no_schema, new_schema):
    zipped_rdd = df_no_schema.rdd.zipWithIndex()
    df = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
    return df

def mono_id(df_no_schema):
    window_spec = Window().orderBy(F.monotonically_increasing_id())
    df = df_no_schema.withColumn("index", F.row_number().over(window_spec))
    return df

new_schema = create_new_schema(df_no_schema)
mono_df = mono_id(df_no_schema)
print(f"Number of Partition: {mono_df.rdd.getNumPartitions()}")
zip_df = zip_rdd(df_no_schema, new_schema)
print(f"Number of Partition: {zip_df.rdd.getNumPartitions()}")

And this is the profiling code that follows:

for column in df.columns:
    start_time = time.time()    
    unique_count = df.select(column).distinct().count()
    unique_percentage = (unique_count / total_rows) * 100
    duplicate_count = total_rows - unique_count
    duplicate_percentage = (duplicate_count / total_rows) * 100 if duplicate_count > 0 else 0
    null_count = df.filter(df[column].isNull()).count()
    null_percentage = (null_count / total_rows) * 100
    quality = int(unique_percentage / 10)
    quality_string = f"{quality}/10"

    unique_payload = {"count": unique_count, "percentage": unique_percentage}
    duplicate_payload = {"count": duplicate_count, "percentage": duplicate_percentage}
    quality_payload = {"count": "", "percentage": quality_string}
    null_payload = {"count": null_count, "percentage": null_percentage}

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-05-09T08:14:04.4633333+00:00
@Varun S Kumar - Thanks for the question and using MS Q&A platform.

Based on the code you provided, it seems like you are trying to add an index column to a large PySpark dataframe and then perform some data profiling and data quality check activities. You have tried using both monotonically_increasing_id and zipWithIndex to add the index column, but monotonically_increasing_id is much faster than zipWithIndex. However, after adding the index column, the resulting dataframe with zipWithIndex takes much longer to profile a column than the dataframe with monotonically_increasing_id.

It is expected that monotonically_increasing_id is faster than zipWithIndex because monotonically_increasing_id generates unique IDs for each row in a distributed fashion, while zipWithIndex requires shuffling the data to ensure that each row gets a unique index. However, monotonically_increasing_id can only guarantee unique IDs within a single Spark partition, which is why the resulting dataframe has only one partition. On the other hand, zipWithIndex can guarantee unique indices across all partitions, which is why the resulting dataframe has the same number of partitions as the original dataframe.

As for the performance difference in profiling a column between the two dataframes, it is hard to say without seeing the profiling code. However, it is possible that the difference in partitioning between the two dataframes is affecting the performance of the profiling code. You may want to try repartitioning the dataframe with zipWithIndex to see if it improves the performance of the profiling code. For example, you can try:

zip_df = zip_df.repartition(1)

This will reduce the number of partitions to one, which may improve the performance of the profiling code. However, keep in mind that this will also reduce the parallelism of subsequent operations on the dataframe, so you may want to repartition the dataframe again before performing other operations.

Hope this helps. Do let us know if you any further queries.

Varun S Kumar 50

@PRADEEPCHEEKATLA Thanks for the reply. I have shared the profiling code in the question. Attaching here again:

for column in df.columns:
    unique_count = df.select(column).distinct().count()
    unique_percentage = (unique_count / total_rows) * 100
    
    duplicate_percentage = (duplicate_count / total_rows) * 100 if duplicate_count > 0 else 0
    null_count = df.filter(df[column].isNull()).count()
    null_percentage = (null_count / total_rows) * 100
    junk_count = df.filter(~df[column].rlike("^[A-Za-z0-9]")).count()
    junk_percentage = (junk_count / total_rows) * 100
    quality = int(unique_percentage / 10)
    quality_string = f"{quality}/10"

    unique_payload = {"count": unique_count, "percentage": unique_percentage}
    duplicate_payload = {"count": duplicate_count, "percentage": duplicate_percentage}
    quality_payload = {"count": "", "percentage": quality_string}
    null_payload = {"count": null_count, "percentage": null_percentage}
    junk_payload = {"count": junk_count, "percentage": junk_percentage}

I am happy to use either of monotonically_increasing_id or zipWithIndex. I just need the best performance. Right now, profiling one of my slowest datasets takes like 8-10 hours. I want to know if there's anything I can change in my approach.

Also my databricks cluster config is as below:

64GB Memory, 8 cores
Runtime13.3.x-scala2.12
Photon 3.6 DBU/h

I need profiling for all the columns. Is there a faster way (maybe SQL queries) other than using a for loop to iterate columns?

PRADEEPCHEEKATLA 90,641 Moderator

@Varun S Kumar - Thank you for sharing the profiling code.

Based on the code, it seems like you are calculating the unique count, duplicate count, null count, and junk count for each column in the dataframe. One way to improve the performance of this code is to use Spark SQL instead of PySpark functions, as Spark SQL can be more optimized for certain operations.

Here's an example of how you can use Spark SQL to calculate the unique count, duplicate count, null count, and junk count for each column in the dataframe:

from pyspark.sql.functions import col

# Convert the dataframe to a temporary view
df.createOrReplaceTempView("my_table")

# Iterate over the columns and calculate the counts using Spark SQL
for column in df.columns:
    unique_count = spark.sql(f"SELECT COUNT(DISTINCT {column}) FROM my_table").collect()[0][0]
    total_rows = df.count()
    unique_percentage = (unique_count / total_rows) * 100
    
    duplicate_count = spark.sql(f"SELECT COUNT(*) - COUNT(DISTINCT {column}) FROM my_table").collect()[0][0]
    duplicate_percentage = (duplicate_count / total_rows) * 100 if duplicate_count > 0 else 0
    
    null_count = spark.sql(f"SELECT COUNT(*) FROM my_table WHERE {column} IS NULL").collect()[0][0]
    null_percentage = (null_count / total_rows) * 100
    
    junk_count = spark.sql(f"SELECT COUNT(*) FROM my_table WHERE NOT {column} RLIKE '^[A-Za-z0-9]'").collect()[0][0]
    junk_percentage = (junk_count / total_rows) * 100
    
    quality = int(unique_percentage / 10)
    quality_string = f"{quality}/10"

    unique_payload = {"count": unique_count, "percentage": unique_percentage}
    duplicate_payload = {"count": duplicate_count, "percentage": duplicate_percentage}
    quality_payload = {"count": "", "percentage": quality_string}
    null_payload = {"count": null_count, "percentage": null_percentage}
    junk_payload = {"count": junk_count, "percentage": junk_percentage}

In this code, we first convert the dataframe to a temporary view using createOrReplaceTempView. Then, we iterate over the columns and use Spark SQL queries to calculate the counts.

Hope this helps. Do let us know if you any further queries.

Varun S Kumar 50 Reputation points

2024-05-10T08:03:39.0133333+00:00

@PRADEEPCHEEKATLA I've read that collect is a high resource using function and also the count function should be outside the for loop right.

I've tried this SQL approach as you've suggested but doesn't seem to improve performance.

Any other way we can tackle this?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-05-14T04:05:30.9+00:00
@Varun S Kumar - You are correct that collect is a high resource using function and should be used with caution. In the code I provided, collect is used to extract the statistics from the result of the SQL query. If the number of distinct values for a column is small, then collect should not be a problem. However, if the number of distinct values is large, then collect can cause memory issues.

Regarding the count function, you are also correct that it should be outside the for loop. In the code I provided, count is used to compute the total number of rows in the dataframe. This should be done outside the for loop to avoid recomputing the count for each column.

If using Spark SQL did not improve the performance of your code, then you can try using the approxQuantile function to compute the statistics for each column. This function computes the approximate quantiles of a column using the Streaming Quantile Algorithm. Here's an example:

from pyspark.sql.functions import approxQuantile total_rows = df.count() for column in df.columns: quantiles = approxQuantile(df[column], [0.0, 0.25, 0.5, 0.75, 1.0], 0.01) unique_count = len(set(df.select(column).collect())) unique_percentage = (unique_count / total_rows) * 100 duplicate_percentage = ((total_rows - unique_count) / total_rows) * 100 null_count = df.filter(df[column].isNull()).count() null_percentage = (null_count / total_rows) * 100 junk_count = df.filter(~df[column].rlike("^[A-Za-z0-9]")).count() junk_percentage = (junk_count / total_rows) * 100 quality = int(unique_percentage / 10) quality_string = f"{quality}/10" unique_payload = {"count": unique_count, "percentage": unique_percentage} duplicate_payload = {"count": total_rows - unique_count, "percentage": duplicate_percentage} quality_payload = {"count": "", "percentage": quality_string} null_payload = {"count": null_count, "percentage": null_percentage} junk_payload = {"count": junk_count, "percentage": junk_percentage}

In this code, we use the approxQuantile function to compute the 0

Hope this helps. Do let us know if you any further queries.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-05-20T05:40:59.8933333+00:00

@Varun S Kumar - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Your answer

Varun S Kumar 50 Reputation points

2024-05-09T10:22:12.2933333+00:00

@PRADEEPCHEEKATLA Thanks for the reply. I have shared the profiling code in the question. Attaching here again:

for column in df.columns: unique_count = df.select(column).distinct().count() unique_percentage = (unique_count / total_rows) * 100 duplicate_percentage = (duplicate_count / total_rows) * 100 if duplicate_count > 0 else 0 null_count = df.filter(df[column].isNull()).count() null_percentage = (null_count / total_rows) * 100 junk_count = df.filter(~df[column].rlike("^[A-Za-z0-9]")).count() junk_percentage = (junk_count / total_rows) * 100 quality = int(unique_percentage / 10) quality_string = f"{quality}/10" unique_payload = {"count": unique_count, "percentage": unique_percentage} duplicate_payload = {"count": duplicate_count, "percentage": duplicate_percentage} quality_payload = {"count": "", "percentage": quality_string} null_payload = {"count": null_count, "percentage": null_percentage} junk_payload = {"count": junk_count, "percentage": junk_percentage}

I am happy to use either of monotonically_increasing_id or zipWithIndex. I just need the best performance. Right now, profiling one of my slowest datasets takes like 8-10 hours. I want to know if there's anything I can change in my approach.

Also my databricks cluster config is as below:

64GB Memory, 8 cores
Runtime13.3.x-scala2.12
Photon 3.6 DBU/h

I need profiling for all the columns. Is there a faster way (maybe SQL queries) other than using a for loop to iterate columns?
Varun S Kumar 50 Reputation points

2024-05-10T08:03:39.0133333+00:00

@PRADEEPCHEEKATLA I've read that collect is a high resource using function and also the count function should be outside the for loop right.

I've tried this SQL approach as you've suggested but doesn't seem to improve performance.

Any other way we can tackle this?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-05-20T05:40:59.8933333+00:00

@Varun S Kumar - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Share via

Indexing a Pyspark dataframe

Your answer