How to process large files with DataFlow Activity in ADF

Manisha Barnwal 0 Reputation points Microsoft Employee
2024-03-01T17:41:32.44+00:00

Hi Team,

We have a Azure Data Factory Pipeline which internally has a Azure DataFlow Activity. This is the link to our Dev ADF: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/5571ff7b-2a93-4208-8dad-000ba00d4131/resourceGroups/DefaultResourceGroup-CUS/providers/Microsoft.DataFactory/factories/datafactorywestus2/overview

Context: For input source to the DataFlow Activity we use the exported files by Azure Application Insights. The Classic Azure Application Insights originally exported the hourly data into multiple ~200MB files (at max the size). The Classic Azure Application Insights got deprecated and the new migrated Application Insights now exports the hourly data into at max 3 files sized > 6GB (sometimes even 90GB).

Problem: My ADF Pipeline was using these files as input but now the ADF Pipeline is all broken as it cannot process large files. Any ways to work-around this issue?

Thanks,

Manisha

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Debarchan Sarkar - MSFT 1,131 Reputation points Microsoft Employee
    2024-03-01T21:04:01.3766667+00:00

    It seems that the change in file size from Azure Application Insights has disrupted your Azure Data Factory pipeline. Here are some strategies you could consider to resolve this issue:

    File Splitting: One approach is to break these large files into smaller chunks before ingestion. Azure Data Lake Store has a capability to read a large file as smaller chunks or partitions.

    Incremental Loading: You might also consider implementing incremental data loading. Instead of reading the entire dataset each time, only new or changed data since the last update would be processed.

    Optimizing Data Factory Performance: There are ways to optimize the performance of Azure Data Factory itself. This includes parallel execution of activities and increasing DIUs (Data Movement Units) for copy activity.

    Refactor your DataFlow: If feasible, you can refactor your DataFlow to better handle large files. This could include using transformations that reduce the amount of data loaded into memory at once, such as 'surrogate-keys' or 'window' transformation.

    Use Mapping Data Flow's Optimized Compute Type: Mapping Data Flows provide an optimized compute type, which offers increased memory and processing power. This computing type is specifically designed to handle large scale data operations.

    Remember to thoroughly test any changes to ensure they solve the problem without introducing new issues. Unfortunately, I don't see a straightforward solution to this problem.

    1 person found this answer helpful.
    0 comments No comments

  2. Greg Low 1,980 Reputation points Microsoft Regional Director
    2024-03-05T03:47:57.63+00:00

    Any chance you can make it a native set of activities rather than a data flow? We actively try to avoid data flows as they are so costly to run anyway. The native copy activity, etc. doesn't have that limit. And it does also have a built-in option for splitting files on output. Might be worth a look. (And it will cost far less to run, and start faster)


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.