Data flow randomly failing

Question

Hi all,

I have a data flow that randomly keeps on failing. Sometimes it works, sometimes it doesn't.

The data flow takes new data from the source (SQL server), and merges it with existing data (Avro on ASDLG2) and then saves it in the same folder as the existing data. This is an approach I use very often, but somehow it fails with this data flow. The error it get is something like:

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running '

In the sink the 'Clear folder' option is on. I have the feeling that this works a bit too good. Maybe it first deletes everything from the folder, before it reads the data (so there's no data). The funny thing is however that sometimes it just works, and sometimes it doesn't. It's so random.

Is am I thinking in the right way? Or are there maybe any other reasons why this is happening?

Thanks for helping me out!

Answer

Hi @Erp, Wessel van ,
Welcome to Microsoft Q&A platform and thanks for posting your query.

Failed with an error: "Error while reading file XXX. It is possible the underlying files have been updated"
Symptoms
When you use the ADLS Gen2 as a sink in the data flow (to preview data, debug/trigger run, etc.) and the partition setting in Optimize tab in the Sink stage is not default, you may find job fail with the following error message:

Job failed due to reason: It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Cause

You don't assign a proper permission to your MI/SP authentication.
You may have a customized job to handle files that you don't want, which will affect the data flow's middle output.

Recommendation

Check if your linked service has the R/W/E permission for Gen2. If you use the MI auth/SP authentication, at least grant the Storage Blob Data - -
Contributor role in the Access control (IAM).
Confirm if you have specific jobs that move/delete files to other place whose name does not match your rule. Because data flows will write down partition files into the target folder firstly and then do the merge and rename operations, the middle file's name might not match your rule.

Please refer to following article for more details in troubleshooting the issue: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-troubleshoot-connector-format#azure-data-lake-storage-gen2

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.
Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Share via

Data flow randomly failing

1 answer