How to loop through a file in Azure data factory having more than 5000 rows?

Phani 0 Reputation points
2023-04-20T10:45:55.58+00:00

I have a file having 10000 rows with a few columns. Assume 3 or 4 columns. Requirements:

  1. I want to consume an API which gives some unique Ids. Example : 3412, 9812,1029 etc.
  2. The above API data is consumed and stored in ADLS Gen2 storage account as a JSON file.
  3. Now I need to use these Unique IDs and pass these Ids one by one to another API and get detailed summary of that Id. I am using data factory as of now to solve the problem. Questions:
  4. Is there any solution to this problem?
  5. As the Lookup activity can get only first 5000 rows from the file in the storage account, so How can I get next 5000 rows so that it can solve my problem of looping the data in a file? Limitations: I don't want to use Azure SQL or any other services. I am looking for the solution in ADF only. Please provide me detailed solution if possible so that it will be helpful for me and others. I didn't find any other solutions on Support portal of Azure.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,966 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Subashri Vasudevan 11,206 Reputation points
    2023-04-21T11:12:13.18+00:00

    Hi, One easy way without involving tables is to split the file using copy activity. My source file has arnd 5600 rows. I used the below option in copy activity to split the files. In my case, two files got generated, one with 5k rows and the other with 600 rows. Screenshot 2023-04-21 at 4.39.25 PM

    Once the files are split, you may loop over them in parallel and then do the processing that you need. Please let us know for any question.


  2. AnnuKumari-MSFT 33,636 Reputation points Microsoft Employee
    2023-04-24T06:10:14.9566667+00:00

    Hi Phani , In addition to the above approach provided by Community expert, You could have a look at the below workarounds as well:

    1. Design a two-level pipeline where the outer pipeline iterates over an inner pipeline, which retrieves data that doesn't exceed the maximum rows or size as suggested here in the Limitations and workarounds of lookup activity,
    2. Also, you could check out the approach suggested in this thread: How to process more than 5000 records in Lookup Activity? Hope it helps. Please accept any of the answers if it's helpful. Thankyou
    0 comments No comments

  3. Izmir Sadigov 111 Reputation points MVP
    2023-04-24T06:17:48.96+00:00

    One common approach to loop through a file with more than 5000 rows in Azure Data Factory is to use the "ForEach" activity along with a custom batch processing logic. Here's a high-level overview of the steps:

    1. Read the file: Use the appropriate connector in Azure Data Factory to read the file from the source, such as a Blob storage or an FTP server, into a data flow or a dataset.
    2. Configure the ForEach activity: Add a ForEach activity to your pipeline and configure it to loop over the rows in the file. You can configure the ForEach activity to use a specific batch size, such as 1000 rows per batch, or to dynamically determine the batch size based on your specific requirements.
    3. Implement batch processing logic: Inside the ForEach activity, you can implement custom batch processing logic to process each batch of rows. For example, you can use a data flow or a mapping activity to transform and process the data in each batch. You can also use activities such as mapping, filtering, aggregating, or any other data transformation activities to process the data in each batch according to your business logic.
    4. Monitor and track progress: You can use the Azure Data Factory monitoring and logging features to monitor the progress of the pipeline and track the processing of each batch. This can help you ensure that the processing is happening as expected and troubleshoot any issues that may arise during the loop.
    5. Handle errors and retries: As with any data processing operation, it's important to handle errors and retries appropriately. You can configure error handling and retry policies within the Azure Data Factory pipeline to handle errors that may occur during the loop, such as network failures, data format issues, or other errors.
    6. Finalize processing: After processing all batches, you can add additional activities in your pipeline to finalize the processing, such as writing the processed data to a destination storage or triggering downstream activities. By using the ForEach activity in combination with custom batch processing logic, you can effectively loop through a file with more than 5000 rows in Azure Data Factory and process the data in batches, overcoming the limitations of the service.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.