Data flow randomly failing

Hi all,
I have a data flow that randomly keeps on failing. Sometimes it works, sometimes it doesn't.
The data flow takes new data from the source (SQL server), and merges it with existing data (Avro on ASDLG2) and then saves it in the same folder as the existing data. This is an approach I use very often, but somehow it fails with this data flow. The error it get is something like:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running '
In the sink the 'Clear folder' option is on. I have the feeling that this works a bit too good. Maybe it first deletes everything from the folder, before it reads the data (so there's no data). The funny thing is however that sometimes it just works, and sometimes it doesn't. It's so random.
Is am I thinking in the right way? Or are there maybe any other reasons why this is happening?
Thanks for helping me out!
Hi @Erp, Wessel van ,
Just reaching out to you to see if the answer helped. If this answers your query, do click
Accept Answer
andUp-Vote
for the same as accepted answers help community as well. If you have any further query do let us know.Hi @AnnuKumari-MSFT ,
I had implemented the solution you described already as a work around and that works. I was hoping for a more clean solution, but it seems that's not possible.
Based on the information I have I see this issue as a result of somekind of bug, so I hope Microsoft wil fix this in the near future. It's very annoying and it occurs in random situations.
For now thank you for your help.
Hi @Erp, Wessel van ,
Appreciate if you could share the feedback on our Data factory feedback channel, which would be open for the user community to upvote & comment on. This allows our product team to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.
Please consider accepting the answer by clicking on
Accept answer
as accepted answer helps community as well.I was having the same issue and found a workaround. Just create a new branch in the process with a dummy sink and in the dataflow settings select the dummy sink as the first one in the write order and then put the real sink that will clear folder and recreate the files.
@Carlos Muñoz ,
Thanks for sharing the resolution steps with the community.
Sign in to comment
1 answer
Sort by: Most helpful
Hi @Erp, Wessel van ,
Welcome to Microsoft Q&A platform and thanks for posting your query.
Failed with an error: "Error while reading file XXX. It is possible the underlying files have been updated"
Symptoms
When you use the ADLS Gen2 as a sink in the data flow (to preview data, debug/trigger run, etc.) and the partition setting in Optimize tab in the Sink stage is not default, you may find job fail with the following error message:
Job failed due to reason: It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Cause
Recommendation
Please refer to following article for more details in troubleshooting the issue: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-troubleshoot-connector-format#azure-data-lake-storage-gen2
Hope this will help. Please let us know if any further queries.
------------------------------
Original posters help the community find answers faster by identifying the correct answer. Here is how
Hi @Erp, Wessel van ,
Just checking in to see if the above suggestion helped. If this answers your query, do click
Accept Answer
andUp-Vote
for the same. And, if you have any further query do let us know.Hi AnnuKumari-MSFT,
Thank you for your response and excuse me for my late response.
I thought the issue was fixed, but it seems it is still happening.
The thing is however that I've set the optimize tab to default, so that may not seem to be the cause.
The issue here is that the sink folder, which is also the source folder, is first emptied (I assume by the sink), before it gets read by the source.
Hi @Erp, Wessel van ,
I tried to repro your scenario and got the same error when I used same folder for source and sink. It seems that's happening because of the same reason as mentioned by you - source and sink folders are same and you have set clear folder option as true in the sink. Before writing the data to the sink , the file gets deleted which causes the error. Please do try one of the options-
If this answers your query, do click
Accept Answer
andUp-Vote
for the same. And, if you have any further query do let us know.Sign in to comment
Activity