Synapse Analytics - Delta Lake - Vacuum is not working (Spark 3.0)

Question

Synapse Analytics - Delta Lake - Vacuum is not working (Spark 3.0)

Anonymous

Hi All.

I tried to use vacuum on Synapse Analytics and it is not working for me. The files are not deleted. I am using Spark pool (Spark 3.0). I had tried this:
from delta.tables import *
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")
deltaTable = DeltaTable.forPath(spark, "abfss://bronze@xxxxxxxxxxxxx .dfs.core.windows.net/delta/atxx.db/peoplexxxx/")
deltaTable.vacuum(retentionHours = 10)

I also tried this:
deltaTable.vacuum()

I have files from 6th of July.

Thanks for your help.

Kind regards,
Anaid

MartinJaffer-MSFT 26,156 Reputation points

2021-07-16T07:00:25.063+00:00

Hello @Anonymous and welcome to Microsoft Q&A.

You said you are using Synapse Analytics Spark 3.0, yet the commands you used reference databricks. I am a little fuzzy on the relation here. While both Synapse and Databricks use Spark, they are not the same product.

Could you try something like

spark.sql("VACUUM delta.{0}".format(delta_table_path)).show()

apache-spark-delta-lake-overview: sql support

delta-batch data retention
Anonymous

2021-07-16T08:51:51.433+00:00

T
Thanks Matin!
I tried that, but it didn't work.
]1

I have files from the 6th of July and they're still there.

Kind regards,
Anaid
colinrippeyfinarne 26 Reputation points

2021-08-11T15:01:56.247+00:00

Hi @MartinJaffer-MSFT

I am seeing the same behaviour as @Anonymous

I have a delta table that I have added some rows (23), deleted the rows, added the rows back in again three times so the underlying parquet files have 92 rows.

If I query the raw parquet files using sql serverless I can see the the 92 rows

If I query the raw parquet files using delta I only see the 23 rows.

If I use a spark notebook to read the data via:

var deltaTable = DeltaTable.ForPath(pathToDeltaTable);

deltaTable.ToDF().Show();

I see 23 rows.

If I run:

deltaTable.History().Show();

I see the expected set of write and delete operations.

I've ran

deltaTable.Vacuum();

(no results show for the above command and unsure what way you see the output if any)

I've ran

spark.Sql($"VACUUM delta.{pathToDeltaTable}").Show();

I see the result:

+--------------------+
| path|
+--------------------+
|abfss://container@sto...|
+--------------------+

But it appears as if the underlying data does not get VACUUMED and so I am left with stale data. I still see the 92 rows in my SQL Serverless query indicating that the raw data still exists in the lake folder.
Ryan Abbey 1,186 Reputation points

2021-08-11T20:34:45.203+00:00

You said you are using Synapse Analytics Spark 3.0, yet the commands you used reference databricks. I am a little fuzzy on the relation here. While both Synapse and Databricks use Spark, they are not the same product.

That is what leaves Synapse so confusing, the delta documentation actually references the "databricks" environment variables for delta

So if the "databricks" variable is wrong, what is it for Delta?

2 answers

Your answer

MartinJaffer-MSFT 26,156 Reputation points

2021-07-16T07:00:25.063+00:00

Hello @Anonymous and welcome to Microsoft Q&A.

You said you are using Synapse Analytics Spark 3.0, yet the commands you used reference databricks. I am a little fuzzy on the relation here. While both Synapse and Databricks use Spark, they are not the same product.

Could you try something like

spark.sql("VACUUM delta.{0}".format(delta_table_path)).show()

apache-spark-delta-lake-overview: sql support

delta-batch data retention
Anonymous

2021-07-16T08:51:51.433+00:00

T
Thanks Matin!
I tried that, but it didn't work.
]1

I have files from the 6th of July and they're still there.

Kind regards,
Anaid
colinrippeyfinarne 26 Reputation points

2021-08-11T15:01:56.247+00:00

Hi @MartinJaffer-MSFT

I am seeing the same behaviour as @Anonymous

I have a delta table that I have added some rows (23), deleted the rows, added the rows back in again three times so the underlying parquet files have 92 rows.

If I query the raw parquet files using sql serverless I can see the the 92 rows

If I query the raw parquet files using delta I only see the 23 rows.

If I use a spark notebook to read the data via:

var deltaTable = DeltaTable.ForPath(pathToDeltaTable);

deltaTable.ToDF().Show();

I see 23 rows.

If I run:

deltaTable.History().Show();

I see the expected set of write and delete operations.

I've ran

deltaTable.Vacuum();

(no results show for the above command and unsure what way you see the output if any)

I've ran

spark.Sql($"VACUUM delta.{pathToDeltaTable}").Show();

I see the result:

+--------------------+
| path|
+--------------------+
|abfss://container@sto...|
+--------------------+

But it appears as if the underlying data does not get VACUUMED and so I am left with stale data. I still see the 92 rows in my SQL Serverless query indicating that the raw data still exists in the lake folder.
Ryan Abbey 1,186 Reputation points

2021-08-11T20:34:45.203+00:00

You said you are using Synapse Analytics Spark 3.0, yet the commands you used reference databricks. I am a little fuzzy on the relation here. While both Synapse and Databricks use Spark, they are not the same product.

That is what leaves Synapse so confusing, the delta documentation actually references the "databricks" environment variables for delta

So if the "databricks" variable is wrong, what is it for Delta?

Answer 1

Arun Sethia 6 Microsoft Employee

Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false. The default retention threshold for vacuum is 7 days (168 hours).

Answer 2

Heimo Hiidenkari 1

I got the compressing and vacuuming to work in a Synapse pyspark notebook with this code. Hope this helps!

from pyspark.sql.functions import *
from pyspark.sql.types import *

numOfFiles = 1
deltaPath = "abfss://******@yourstorageaccountname.dfs.core.windows.net/publish/data_table_name"

(spark.read
.format("delta")
.load(deltaPath)
.repartition(numOfFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(deltaPath))

spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")
spark.sql("VACUUM `publish`.`data_table_name` RETAIN 0 HOURS").show()

Heimo

colinrippeyfinarne 26 Reputation points

2021-08-13T11:03:54.633+00:00

Thanks @Heimo Hiidenkari that worked for me, I have ran the VACUUM after setting the

SET spark.databricks.delta.retentionDurationCheck.enabled = false

And I now see the data being deleted.

Share via

Synapse Analytics - Delta Lake - Vacuum is not working (Spark 3.0)

2 answers

Your answer