Reading Delta logs from PySpark throws: FileNotFoundException

Question

Reading Delta logs from PySpark throws: FileNotFoundException

Jose Gonzalez Gongora 25 Microsoft Employee

While reading from delta logs, PySpark tries to fetch parquet files that were already remove by the existing retention policy (Data that's 31 days old). I don't believe this this issue related to Pyspark, it's possible that the process that removes 30+ days old data takes more an entire day to complete. And if that's the case, how I can efficiently read from the delta logs without facing this issue.

Sedat SALMAN 14,180 Reputation points MVP

2023-05-13T18:43:45.93+00:00
Delta Lake's transaction log is a record of every change made to data providing full audit history of the changes. It's an ordered, atomic collection of actions that can be read to replay history and therefore enables operations like time travel, full historical audit, replication, and incremental ETL.

The log files for Delta Lake are stored in a _delta_log subdirectory. These are JSON files that contain atomic sets of actions that, when replayed, reconstruct the state of the data at a point in time.

However, to save on storage cost, Delta Lake provides a feature called "log cleanup" which by default retains only the last 30 days of logs. This means any attempt to time travel beyond the log retention period will fail as required logs have been removed.

The FileNotFoundException you are getting could be related to this cleanup action. You are trying to read a file that is no longer there because the retention policy has cleaned it up.

Increase the log retention period: This can be done by setting the configuration spark.databricks.delta.retentionDurationCheck.enabled to false. However, keep in mind this will increase storage costs as more files are retained.

Use Delta Lake's Vacuum function: The vacuum function recursively vacuums directories associated with the Delta table and removes data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days.

Example:

deltaTable.vacuum() # uses the default retention period of 7 days deltaTable.vacuum(100) # sets retention period to 100 hours
Jose Gonzalez Gongora 25 Reputation points Microsoft Employee

2023-05-17T23:52:34.6033333+00:00

Thanks for your answers, I realize that the query I was using to read from the delta tables wasn't filtering per partition (ingestion_date) but by column (timestamp). So, even though both of these answers would help to solve my problem, the root cause of such a problem was that my query was poorly written.

2 answers

Your answer

Sedat SALMAN 14,180 Reputation points MVP

2023-05-13T18:43:45.93+00:00

Delta Lake's transaction log is a record of every change made to data providing full audit history of the changes. It's an ordered, atomic collection of actions that can be read to replay history and therefore enables operations like time travel, full historical audit, replication, and incremental ETL.

The log files for Delta Lake are stored in a _delta_log subdirectory. These are JSON files that contain atomic sets of actions that, when replayed, reconstruct the state of the data at a point in time.

However, to save on storage cost, Delta Lake provides a feature called "log cleanup" which by default retains only the last 30 days of logs. This means any attempt to time travel beyond the log retention period will fail as required logs have been removed.

The FileNotFoundException you are getting could be related to this cleanup action. You are trying to read a file that is no longer there because the retention policy has cleaned it up.

Increase the log retention period: This can be done by setting the configuration spark.databricks.delta.retentionDurationCheck.enabled to false. However, keep in mind this will increase storage costs as more files are retained.

Use Delta Lake's Vacuum function: The vacuum function recursively vacuums directories associated with the Delta table and removes data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days.

Example:

deltaTable.vacuum() # uses the default retention period of 7 days deltaTable.vacuum(100) # sets retention period to 100 hours
Jose Gonzalez Gongora 25 Reputation points Microsoft Employee

2023-05-17T23:52:34.6033333+00:00

Thanks for your answers, I realize that the query I was using to read from the delta tables wasn't filtering per partition (ingestion_date) but by column (timestamp). So, even though both of these answers would help to solve my problem, the root cause of such a problem was that my query was poorly written.

Answer 1

Bhargava-MSFT 31,261 Microsoft Employee Moderator

Hello Jose Gonzalez Gongora,

Welcome to the MS Q&A platform.

To avoid this issue, please configure the retention policy for the delta logs to be less than 31 days. You can also use the immediatePurgeDataOn30Days parameter to trigger an immediate purge of older data.

Reference document: https://learn.microsoft.com/en-us/azure/azure-monitor/logs/data-retention-archive?tabs=portal-1%2Cportal-2

I hope this helps. Please let us know if you have any further questions.

Answer 2

Jose Gonzalez Gongora 25 Microsoft Employee

I realize that the query I was using to read from the delta tables wasn't filtering per partition (ingestion_date) but by column (timestamp). So, even though both of these answers would help to solve my problem, the root cause of such a problem was that my query was poorly written.

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-05-18T00:15:41.85+00:00

Hello Jose Gonzalez Gongora,

Glad to know that you have found the issue, and thank you for sharing the actual root cause of the issue.

Share via

Reading Delta logs from PySpark throws: FileNotFoundException

2 answers

Your answer