Synapse Analytics - Delta Lake - Vacuum is not working (Spark 3.0)

Anonymous
2021-07-15T10:16:56.2+00:00

Hi All.

I tried to use vacuum on Synapse Analytics and it is not working for me. The files are not deleted. I am using Spark pool (Spark 3.0). I had tried this:
from delta.tables import *
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")
deltaTable = DeltaTable.forPath(spark, "abfss://bronze@xxxxxxxxxxxxx .dfs.core.windows.net/delta/atxx.db/peoplexxxx/")
deltaTable.vacuum(retentionHours = 10)

I also tried this:
deltaTable.vacuum()

I have files from 6th of July.

Thanks for your help.

Kind regards,
Anaid

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,384 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Arun Sethia 6 Reputation points Microsoft Employee
    2022-11-07T18:05:12.233+00:00

    Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false. The default retention threshold for vacuum is 7 days (168 hours).

    1 person found this answer helpful.
    0 comments No comments

  2. Heimo Hiidenkari 1 Reputation point
    2021-08-13T07:17:06.113+00:00

    I got the compressing and vacuuming to work in a Synapse pyspark notebook with this code. Hope this helps!

    from pyspark.sql.functions import *
    from pyspark.sql.types import *

    numOfFiles = 1
    deltaPath = "abfss://******@yourstorageaccountname.dfs.core.windows.net/publish/data_table_name"

    (spark.read
    .format("delta")
    .load(deltaPath)
    .repartition(numOfFiles)
    .write
    .option("dataChange", "false")
    .format("delta")
    .mode("overwrite")
    .save(deltaPath))

    spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")
    spark.sql("VACUUM `publish`.`data_table_name` RETAIN 0 HOURS").show()

    • Heimo

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.