Delta's OPTIMIZE
command will try to bin-pack smaller files into larger ones based on the target size. By default, this size is 1 GB. This means, if your total data size is much smaller than 1 GB (as in your case with 43 records), it might not necessarily be combined into a single file. You can adjust this by setting the spark.databricks.delta.optimize.maxFileSize
configuration.
If you've performed Z-Ordering on your delta table, the optimize might produce multiple files to respect the Z-Ordering layout. Remember, Z-Ordering is a trade-off between query performance and storage efficiency.
The properties you've set (spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite
and spark.databricks.delta.properties.defaults.autoOptimize.autoCompact
) affect the writes and auto compaction. However, when you call deltaTable.optimize().executeCompaction()
manually, these settings might not play a role. The explicit call should be compacting the data based on the aforementioned bin-packing logic.
The vacuum()
function you've called cleans up old files. However, it respects a retention period, which is by default 7 days. This means it won't immediately delete old files post-compaction. If you're certain about removing old versions and want to reduce the retention period, you can do so cautiously. However, be aware that this reduces your ability to perform time travel queries on older versions of the data.
deltaTable.vacuum(retentionHours=1) # 1 hour retention, use with caution