How to process large files with DataFlow Activity in ADF

Question

How to process large files with DataFlow Activity in ADF

Manisha Barnwal 0 Microsoft Employee

Hi Team,

We have a Azure Data Factory Pipeline which internally has a Azure DataFlow Activity. This is the link to our Dev ADF: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/5571ff7b-2a93-4208-8dad-000ba00d4131/resourceGroups/DefaultResourceGroup-CUS/providers/Microsoft.DataFactory/factories/datafactorywestus2/overview

Context: For input source to the DataFlow Activity we use the exported files by Azure Application Insights. The Classic Azure Application Insights originally exported the hourly data into multiple ~200MB files (at max the size). The Classic Azure Application Insights got deprecated and the new migrated Application Insights now exports the hourly data into at max 3 files sized > 6GB (sometimes even 90GB).

Problem: My ADF Pipeline was using these files as input but now the ADF Pipeline is all broken as it cannot process large files. Any ways to work-around this issue?

Thanks,

Manisha

Manisha Barnwal 0 Reputation points Microsoft Employee

2024-03-01T22:28:17.32+00:00

Hi @Debarchan Sarkar - MSFT , Thanks for the response, can you elaborate more on: Incremental Loading: Does ADF offer this feature, can you share the link?

Refactor your DataFlow: The DataFlow is failing while reading the source. Is there any source setting which can be modified to take 100GB sized files?

Use Mapping Data Flow's Optimized Compute Type: I tried setting scaling the partitions to 30 for a 7GB input file, it still failed. Any recommendations? Also, the setting of partitions is static as per the Portal, do you know if it can be made dynamic?

I wasn't able to find if there is a file size-limit for the DataFlow, it just mentioned size-limit of MetaData file is 2GB. Do you know if there is a size-limit on input file?

Thanks,

Manisha
Debarchan Sarkar - MSFT 1,131 Reputation points Microsoft Employee

2024-03-02T23:35:37.22+00:00

Here are some ways - not sure if they fit your needs.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
Manisha Barnwal 0 Reputation points Microsoft Employee

2024-03-05T01:47:49.6733333+00:00

Incremental loading doesn't work for me as my file is one-time created, I just need a way to work-around large file as input into DataFlow as it does seem not to working with any file large than 2.5GB. Can you share references on how to split the file using Azure Data Lake?

2 answers

Your answer

Manisha Barnwal 0 Reputation points Microsoft Employee

2024-03-01T22:28:17.32+00:00

Hi @Debarchan Sarkar - MSFT , Thanks for the response, can you elaborate more on: Incremental Loading: Does ADF offer this feature, can you share the link?

Refactor your DataFlow: The DataFlow is failing while reading the source. Is there any source setting which can be modified to take 100GB sized files?

Use Mapping Data Flow's Optimized Compute Type: I tried setting scaling the partitions to 30 for a 7GB input file, it still failed. Any recommendations? Also, the setting of partitions is static as per the Portal, do you know if it can be made dynamic?

I wasn't able to find if there is a file size-limit for the DataFlow, it just mentioned size-limit of MetaData file is 2GB. Do you know if there is a size-limit on input file?

Thanks,

Manisha
Debarchan Sarkar - MSFT 1,131 Reputation points Microsoft Employee

2024-03-02T23:35:37.22+00:00

Here are some ways - not sure if they fit your needs.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
Manisha Barnwal 0 Reputation points Microsoft Employee

2024-03-05T01:47:49.6733333+00:00

Incremental loading doesn't work for me as my file is one-time created, I just need a way to work-around large file as input into DataFlow as it does seem not to working with any file large than 2.5GB. Can you share references on how to split the file using Azure Data Lake?

Answer 1

It seems that the change in file size from Azure Application Insights has disrupted your Azure Data Factory pipeline. Here are some strategies you could consider to resolve this issue:

File Splitting: One approach is to break these large files into smaller chunks before ingestion. Azure Data Lake Store has a capability to read a large file as smaller chunks or partitions.

Incremental Loading: You might also consider implementing incremental data loading. Instead of reading the entire dataset each time, only new or changed data since the last update would be processed.

Optimizing Data Factory Performance: There are ways to optimize the performance of Azure Data Factory itself. This includes parallel execution of activities and increasing DIUs (Data Movement Units) for copy activity.

Refactor your DataFlow: If feasible, you can refactor your DataFlow to better handle large files. This could include using transformations that reduce the amount of data loaded into memory at once, such as 'surrogate-keys' or 'window' transformation.

Use Mapping Data Flow's Optimized Compute Type: Mapping Data Flows provide an optimized compute type, which offers increased memory and processing power. This computing type is specifically designed to handle large scale data operations.

Remember to thoroughly test any changes to ensure they solve the problem without introducing new issues. Unfortunately, I don't see a straightforward solution to this problem.

Answer 2

Greg Low 1,980 Microsoft Regional Director

Any chance you can make it a native set of activities rather than a data flow? We actively try to avoid data flows as they are so costly to run anyway. The native copy activity, etc. doesn't have that limit. And it does also have a built-in option for splitting files on output. Might be worth a look. (And it will cost far less to run, and start faster)

Manisha Barnwal 0 Reputation points Microsoft Employee

2024-03-05T03:59:39.7733333+00:00

Can you share references (I can take a look into it for future changes)? Currently, we already have DataFlow in Prod pipeline which was working fine as it takes Azure AppInsights multiple 100MB files as input but with the new AppInsights the data is in a giant 100GB blob :(
Greg Low 1,980 Reputation points Microsoft Regional Director

2024-03-05T05:11:40.8966667+00:00

Details on the Copy Data Activity are here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

And if you choose a folder as the output dataset type, you can have it automatically split the output into files with a maximum number of rows:

Also, when you later move to Microsoft Fabric (as is the likely outcome), the current data flows experience is being replaced with a different implementation anyway.

Share via

How to process large files with DataFlow Activity in ADF

2 answers

Your answer