Reading Delta logs from PySpark throws: FileNotFoundException

Jose Gonzalez Gongora 25 Reputation points Microsoft Employee
2023-05-12T22:17:54.4933333+00:00

While reading from delta logs, PySpark tries to fetch parquet files that were already remove by the existing retention policy (Data that's 31 days old). I don't believe this this issue related to Pyspark, it's possible that the process that removes 30+ days old data takes more an entire day to complete. And if that's the case, how I can efficiently read from the delta logs without facing this issue.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,562 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-05-17T23:45:54.78+00:00

    Hello Jose Gonzalez Gongora,

    Welcome to the MS Q&A platform.

    To avoid this issue, please configure the retention policy for the delta logs to be less than 31 days. You can also use the immediatePurgeDataOn30Days parameter to trigger an immediate purge of older data.

    Reference document: https://learn.microsoft.com/en-us/azure/azure-monitor/logs/data-retention-archive?tabs=portal-1%2Cportal-2

    I hope this helps. Please let us know if you have any further questions.

    1 person found this answer helpful.
    0 comments No comments

  2. Jose Gonzalez Gongora 25 Reputation points Microsoft Employee
    2023-05-17T23:53:35.79+00:00

    I realize that the query I was using to read from the delta tables wasn't filtering per partition (ingestion_date) but by column (timestamp). So, even though both of these answers would help to solve my problem, the root cause of such a problem was that my query was poorly written.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.