Azure Databricks workflow job failure

NIKHIL KUMAR 101 Reputation points
2024-05-02T08:23:36.4266667+00:00

We have a stream workflow job that run 24*7 and loads the data in delta table for say: raw.deltaTableA

Now, the problem is in case we are trying to optimize this delta (optimize raw.deltaTableA) table while the table is getting loaded we get frequent failures both the sides (i.e. Optimize query & stream workflow job).

Error: delta.exceptions.ConcurrentDeleteDeleteException: This transaction attempted to delete one or more files that were deleted by a concurrent update.

How can this be made sure that optimize query impact the ongoing data load by stream job ? Or is there any other way the optimization of this delta table be achieved ?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,272 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,446 Reputation points
    2024-05-02T10:21:25.59+00:00

    @NIKHIL KUMAR - Thanks for the question and using MS Q&A platform.

    The error message you are seeing indicates that there is a conflict between the ongoing data load and the optimization query. This is because the optimization query is trying to delete files that are being used by the ongoing data load.

    To avoid this conflict, you can try the following options:

    • Schedule the optimization query to run during off-peak hours when the data load is not happening. This will ensure that the optimization query does not interfere with the ongoing data load.
    • Increase the number of files in the delta table. This will reduce the likelihood of the optimization query and the data load trying to access the same file at the same time.
    • Use a different delta table for the optimization query. This will ensure that the optimization query does not interfere with the ongoing data load.
    • Use a different storage account for the delta table. This will ensure that the optimization query does not interfere with the ongoing data load.
    • Use a different cluster for the optimization query. This will ensure that the optimization query does not interfere with the ongoing data load.

    I hope this helps! Let me know if you have any further questions.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.