Data flow is not sinking all data because I have 84 json files in different sub-folders (of which 78 files are blank, and 6 files have data). But when I delete the 78 blank files (and remain with 6 files with data), the data flow will sink all results.

Question

Hello. Please support, my data flow pipeline is not sinking all data. This happens whenever I have 84 json files each in different sub-folders (of which 78 files are blank, and 6 files have data). But when I delete the 78 blank json files (and only remain with 6 json files with data), the data flow will sink all results.

The blank json files has have 2 bytes, with open and close square brackets like this [].

Also, the sink data preview brings all results as shown below (83 results). But when running the pipeline, only 15 results are returned as shown on the second image below.

User's image

2nd image with pipeline result below

User's image

Accepted Answer

Hello Babu Musanga,

Thanks for posting your question in the Microsoft Q&A forum.

Azure Data Factory's data flow treats empty JSON files (with just []) as valid files with zero rows. This means that even though the files are technically empty, they are still processed and counted as part of the data flow execution. So remove or exclude the empty JSON files from the source folder before running the data flow pipeline.
Data flows in Azure Data Factory use partitioning to parallelize the data processing. If the partitioning strategy is not optimal for your dataset, it may lead to incomplete or inconsistent results.
Azure Data Factory's data flow engine may apply certain optimizations or transformations that could potentially cause data loss or inconsistencies in specific scenarios.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful

Answer

@Babu Musanga

Thanks for posting your question in the Microsoft Q&A forum.

I'd be glad to assist you in filtering out empty JSON files within the ADF data flow source activity, while preserving them in the source folder for audit purposes.

Leverage Data Flow Transformation

Add a Derived Column: In the data flow source activity, add a derived column named has_data (or any preferred name).

Expression for has_data: Use the following expression to check for empty.

@equals(length(try_cast(split(activity('Source').output.content, '')[0], json)), 2)

This expression splits the JSON content on an empty delimiter ('') and checks if the resulting array has only one element (length = 2) indicating empty content (just square brackets []). Filter Based on has_data: Add a Filter activity after the source. In the filter settings:

Set the filter condition to has_data (or your chosen column name).
Select the operator as is equal to.
Choose the value as true.

This approach ensures that only files with actual data (non-empty JSON) are processed further in the data flow, effectively excluding empty files without modifying the source directory.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer

Thank you very much @phemanth and @hossein jalilian !

It looks like whenever there are many blank json files in the datalake storage account, a data flow transformation pipeline might fail to process all json files. I modified @hossein jalilian solution by following the steps below:

Extract and load json file to folder "a" in datalake storage account
If json file in folder "a" is greater than 2 bytes, copy it to folder "b"
Run data flow transformation pipeline using folder "b"

Thank you

Share via

Data flow is not sinking all data because I have 84 json files in different sub-folders (of which 78 files are blank, and 6 files have data). But when I delete the 78 blank files (and remain with 6 files with data), the data flow will sink all results.

2 additional answers

Your answer