Auto Compaction of delta table in Synpse

Question

Hello,

I have a delta table, and I do insert, deletes, and update on the delta table. At a moment, I have 43 records available in the delta table in Azure synapse, I did use compaction logic as below, my assumption is, as the data size is less, spark compacts entire data to singe (if not two files) parquet file(s), but I see 20(around) parquet files each file has less than 5kiB. Can someone please help me why spark not compacting data to lesser files.

from delta.tables import *
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite","true")
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.autoCompact","true")
deltaTable = DeltaTable.forPath(spark, "")
deltaTable.optimize().executeCompaction()
deltaTable.vacuum()

Here is the image from storage container. User's image

Accepted Answer

Delta's OPTIMIZE command will try to bin-pack smaller files into larger ones based on the target size. By default, this size is 1 GB. This means, if your total data size is much smaller than 1 GB (as in your case with 43 records), it might not necessarily be combined into a single file. You can adjust this by setting the spark.databricks.delta.optimize.maxFileSize configuration.
If you've performed Z-Ordering on your delta table, the optimize might produce multiple files to respect the Z-Ordering layout. Remember, Z-Ordering is a trade-off between query performance and storage efficiency.

The properties you've set (spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite and spark.databricks.delta.properties.defaults.autoOptimize.autoCompact) affect the writes and auto compaction. However, when you call deltaTable.optimize().executeCompaction() manually, these settings might not play a role. The explicit call should be compacting the data based on the aforementioned bin-packing logic.

The vacuum() function you've called cleans up old files. However, it respects a retention period, which is by default 7 days. This means it won't immediately delete old files post-compaction. If you're certain about removing old versions and want to reduce the retention period, you can do so cautiously. However, be aware that this reduces your ability to perform time travel queries on older versions of the data.


deltaTable.vacuum(retentionHours=1)  # 1 hour retention, use with caution

Share via

Auto Compaction of delta table in Synpse

0 additional answers