Auto Compaction of delta table in Synpse

DeGuy 40 Reputation points
2023-09-14T13:48:00.5366667+00:00

Hello,

I have a delta table, and I do insert, deletes, and update on the delta table. At a moment, I have 43 records available in the delta table in Azure synapse, I did use compaction logic as below, my assumption is, as the data size is less, spark compacts entire data to singe (if not two files) parquet file(s), but I see 20(around) parquet files each file has less than 5kiB. Can someone please help me why spark not compacting data to lesser files.

from delta.tables import *
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite","true")
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.autoCompact","true")
deltaTable = DeltaTable.forPath(spark, "<path-to-delta-table>")
deltaTable.optimize().executeCompaction()
deltaTable.vacuum()

Here is the image from storage container. User's image

Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
2,647 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,316 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 14,251 Reputation points
    2023-09-15T09:55:13.2133333+00:00

    Delta's OPTIMIZE command will try to bin-pack smaller files into larger ones based on the target size. By default, this size is 1 GB. This means, if your total data size is much smaller than 1 GB (as in your case with 43 records), it might not necessarily be combined into a single file. You can adjust this by setting the spark.databricks.delta.optimize.maxFileSize configuration.
    If you've performed Z-Ordering on your delta table, the optimize might produce multiple files to respect the Z-Ordering layout. Remember, Z-Ordering is a trade-off between query performance and storage efficiency.

    The properties you've set (spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite and spark.databricks.delta.properties.defaults.autoOptimize.autoCompact) affect the writes and auto compaction. However, when you call deltaTable.optimize().executeCompaction() manually, these settings might not play a role. The explicit call should be compacting the data based on the aforementioned bin-packing logic.

    The vacuum() function you've called cleans up old files. However, it respects a retention period, which is by default 7 days. This means it won't immediately delete old files post-compaction. If you're certain about removing old versions and want to reduce the retention period, you can do so cautiously. However, be aware that this reduces your ability to perform time travel queries on older versions of the data.

    
    deltaTable.vacuum(retentionHours=1)  # 1 hour retention, use with caution
    

0 additional answers

Sort by: Most helpful