How to post multi-GB data file to REST API using Azure Data Factory?

Question

How to post multi-GB data file to REST API using Azure Data Factory?

JegorsM 0

Hello,

I am currently working on a data ingestion pipeline which needs to process some larger files (up to 5GB) and post them to a REST API. The files are stored in Azure Blob Storage.

The data cannot be sent in chunks and so needs to be sent whole, however I'm facing various memory issues in multiple services: Initially tried this in Azure Functions, but faced the 1.5GB memory limit, and in Data Factory both the Lookup to load from dataset and Web activity Response for a GET request to Blob Storage have limits of 4MB.

Is there any other way to load the data into Data Factory without running into these issues?

Thank you in advance for any responses.

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-19T10:06:23.73+00:00
@JegorsM

Thanks for reaching out to Microsoft Q&A.

You're right, Azure Functions and Data Factory activities have limitations when dealing with large files. Here are a few alternative approaches for your data ingestion pipeline in Azure that can handle 5GB files without memory issues:

Azure Data Factory with Azure Batch Integration:

Use Data Factory to orchestrate the workflow.

Instead of using a Lookup activity within Data Factory, leverage a separate Azure Batch job.

The Batch job can download the large file from Blob Storage using the Azure Storage SDK.

Process and prepare the data for the API call within the Batch job (without memory constraints of Data Factory activities).

Finally, the Batch job can trigger a Web activity in Data Factory to send the prepared data to the REST API.

Azure Data Factory with Custom Activity:

Develop a custom Data Factory activity using Azure Functions or another language.

This custom activity can be written to handle larger files specifically, bypassing the 4MB limit of Data Factory's built-in activities.

The custom activity can download the data from Blob Storage, process it, and call the REST API.

Azure Data Explorer (ADX) with Event Grid Trigger:

If your scenario involves near real-time ingestion and the data format is compatible (CSV, JSON, etc.), consider Azure Data Explorer (ADX).

Configure an Event Grid trigger on Blob storage to fire whenever a new file is uploaded.

ADX can handle large data files (up to 1GB per record) and process them for further analysis or integration with the REST API.You're right, Azure Functions and Data Factory activities have limitations when dealing with large files.

Hope this helps. Do let us know if you any further queries.
JegorsM 0 Reputation points

2024-03-20T11:08:44.18+00:00
@phemanth Thank you for the response!

I've looked into using Azure Batch service for the file processing, however here I would have some follow up questions:

How is it possible to return a value from a Batch job into the Data Factory pipeline? (e.g. in this case, the prepared API data to be passed to the Web activity) Would both the Custom batch service call and Web API call be a part of the same pipeline?

Since there is the possibility that a large batch of files being uploaded at the same time, would that disrupt the batch job execution? For example, would I need to configure additional nodes in a pool if there are 100+ files uploaded at the same time, and the pipeline gets triggered from a blob storage trigger
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-21T09:58:18.3733333+00:00
@JegorsM You're on the right track considering Azure Batch for processing large files. Here's how to address your questions:

Returning Values from Batch Job to Data Factory Pipeline:

Data Factory (ADF) pipelines can't directly receive return values from Azure Batch jobs. However, there are workarounds to achieve this:

Output Files: The Batch job can write the prepared API data to an output file in Azure Blob Storage. ADF can then use a separate Web Activity to read the output file and use the data for the subsequent Web API call.

Azure Queue Storage: The Batch job can write the prepared data to an Azure Queue Storage queue. ADF can then have a separate Wait Until activity that monitors the queue for completion. Once the data is available in the queue, ADF can use a Web Activity to retrieve and utilize it for the API call.

Azure Key Vault: The Batch job can store the prepared data in Azure Key Vault as a secret. ADF can then use a separate Get Secret activity to access the data from Key Vault and use it for the Web API call.

Combining Batch Job and Web Activity in a Pipeline:

Yes, both the Custom Batch activity (to trigger the Batch job) and the Web Activity (to call the REST API) can be part of the same ADF pipeline. You can chain them sequentially. The Custom Batch activity triggers the Batch job processing the files. Once the Batch job finishes (monitored through techniques mentioned above), the Web Activity in the pipeline is executed to call the REST API with the prepared data.

Handling Large Numbers of Uploads: Multiple uploaded files at once won't necessarily disrupt the Batch job execution, but it can impact performance. Here's how to manage it:

Scaling Batch Pool: If you anticipate a high volume of uploads, consider using an auto-scaling Batch pool in Azure. This pool can automatically add or remove compute nodes based on the workload, ensuring sufficient resources for processing all files efficiently.

Batch Job Design: Design your Batch job to handle multiple files. You can create multiple tasks within the Batch job, each responsible for processing a single file. This approach parallelizes processing and utilizes available compute resources effectively.

ADF Trigger Strategy: If you're concerned about overwhelming the Batch pool during peak upload times, consider adjusting the ADF trigger. Instead of triggering on every blob upload, consider a tumbling window trigger that activates the pipeline periodically (e.g., every 15 minutes) to process a batch of uploaded files.

Choosing the Right Approach:

Output Files: Simple and efficient for smaller datasets. Requires additional storage management.

Azure Queue Storage: Offers loose coupling between Batch job and ADF. Requires additional monitoring for queue completion.

Azure Key Vault: Securely stores data, but adds complexity to the pipeline flow.

By leveraging Azure Batch with these techniques, you can effectively process large files and integrate their processed data into your Data Factory pipeline for sending to the REST API.
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-22T05:00:00.1433333+00:00

@JegorsM We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
JegorsM 0 Reputation points

2024-03-25T07:00:37.6566667+00:00

@phemanth Hi, I've been testing out a solution of mixing both Data Factory Web Activity and Batch jobs, depending on the file size. I have set up an auto-scaling batch job with node capacity to handle the upper end of expected files, and have tested out parallel execution when a batch of multiple files is uploaded, and that has worked relatively well. Regarding the API upload of larger files, I am handling reading and sending directly in the Batch job script.

Once I have tested the solution some more, I will update here whether or not there are any issues, and if not then post an explanation of a working solution.

Right now my goal is to test this solution out with the upper end of large files to know if there will be any issue there
JegorsM 0 Reputation points

2024-03-25T10:40:23+00:00

@phemanth What ways are there to pass parameters to the batch job in Data Factory?

I know it is possible to pass them as arguments when doing the call to the script, however since I want to parametrize a lot of the variables, it would mean that there would be a fair amount of arguments to pass. Are there any other ways of accessing variables from within the Batch job script?
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-26T05:33:08.63+00:00
@JegorsM You're right, passing a large number of arguments to a batch job in Data Factory can become cumbersome. Here are a few alternative ways to access variables within your Batch job script:

Storing Parameters in Azure Blob Storage:

Store your parameters in a JSON file within Azure Blob Storage. In your Batch script, use tools like azcopy or the Azure Blob Storage SDK to download and parse the JSON file, accessing the variables you need. This approach offers flexibility for complex parameter sets.

Using Azure Key Vault:

For sensitive parameters, consider storing them securely in Azure Key Vault. Access them within your Batch script using the Azure Key Vault SDK or managed identities to retrieve values securely. This ensures better control and access management.

Custom Activity with Script:

Utilize a Data Factory Custom Activity to execute a script (e.g., Python) that retrieves or generates the necessary parameters. This script can interact with external sources (like databases) or manipulate data to prepare the variables dynamically.

1 answer

Your answer

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-19T10:06:23.73+00:00

@JegorsM

Thanks for reaching out to Microsoft Q&A.

You're right, Azure Functions and Data Factory activities have limitations when dealing with large files. Here are a few alternative approaches for your data ingestion pipeline in Azure that can handle 5GB files without memory issues:

Azure Data Factory with Azure Batch Integration:

Use Data Factory to orchestrate the workflow.

Instead of using a Lookup activity within Data Factory, leverage a separate Azure Batch job.

The Batch job can download the large file from Blob Storage using the Azure Storage SDK.

Process and prepare the data for the API call within the Batch job (without memory constraints of Data Factory activities).

Finally, the Batch job can trigger a Web activity in Data Factory to send the prepared data to the REST API.

Azure Data Factory with Custom Activity:

Develop a custom Data Factory activity using Azure Functions or another language.

This custom activity can be written to handle larger files specifically, bypassing the 4MB limit of Data Factory's built-in activities.

The custom activity can download the data from Blob Storage, process it, and call the REST API.

Azure Data Explorer (ADX) with Event Grid Trigger:

If your scenario involves near real-time ingestion and the data format is compatible (CSV, JSON, etc.), consider Azure Data Explorer (ADX).

Configure an Event Grid trigger on Blob storage to fire whenever a new file is uploaded.

ADX can handle large data files (up to 1GB per record) and process them for further analysis or integration with the REST API.You're right, Azure Functions and Data Factory activities have limitations when dealing with large files.

Hope this helps. Do let us know if you any further queries.
JegorsM 0 Reputation points

2024-03-20T11:08:44.18+00:00

@phemanth Thank you for the response!

I've looked into using Azure Batch service for the file processing, however here I would have some follow up questions:

How is it possible to return a value from a Batch job into the Data Factory pipeline? (e.g. in this case, the prepared API data to be passed to the Web activity) Would both the Custom batch service call and Web API call be a part of the same pipeline?

Since there is the possibility that a large batch of files being uploaded at the same time, would that disrupt the batch job execution? For example, would I need to configure additional nodes in a pool if there are 100+ files uploaded at the same time, and the pipeline gets triggered from a blob storage trigger
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-22T05:00:00.1433333+00:00

@JegorsM We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
JegorsM 0 Reputation points

2024-03-25T07:00:37.6566667+00:00

@phemanth Hi, I've been testing out a solution of mixing both Data Factory Web Activity and Batch jobs, depending on the file size. I have set up an auto-scaling batch job with node capacity to handle the upper end of expected files, and have tested out parallel execution when a batch of multiple files is uploaded, and that has worked relatively well. Regarding the API upload of larger files, I am handling reading and sending directly in the Batch job script.

Once I have tested the solution some more, I will update here whether or not there are any issues, and if not then post an explanation of a working solution.

Right now my goal is to test this solution out with the upper end of large files to know if there will be any issue there
JegorsM 0 Reputation points

2024-03-25T10:40:23+00:00

@phemanth What ways are there to pass parameters to the batch job in Data Factory?

I know it is possible to pass them as arguments when doing the call to the script, however since I want to parametrize a lot of the variables, it would mean that there would be a fair amount of arguments to pass. Are there any other ways of accessing variables from within the Batch job script?
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-26T05:33:08.63+00:00

@JegorsM You're right, passing a large number of arguments to a batch job in Data Factory can become cumbersome. Here are a few alternative ways to access variables within your Batch job script:

Storing Parameters in Azure Blob Storage:

Store your parameters in a JSON file within Azure Blob Storage. In your Batch script, use tools like azcopy or the Azure Blob Storage SDK to download and parse the JSON file, accessing the variables you need. This approach offers flexibility for complex parameter sets.

Using Azure Key Vault:

For sensitive parameters, consider storing them securely in Azure Key Vault. Access them within your Batch script using the Azure Key Vault SDK or managed identities to retrieve values securely. This ensures better control and access management.

Custom Activity with Script:

Utilize a Data Factory Custom Activity to execute a script (e.g., Python) that retrieves or generates the necessary parameters. This script can interact with external sources (like databases) or manipulate data to prepare the variables dynamically.

Answer 1

To be honest, your use case is challenging especially when the data needs to be sent as a whole rather than in chunks

Since that the files are already in Azure Blob Storage, one workaround could be to generate a Shared Access Signature token for the file and then send the URL with the SAS token to the REST API. This method would require the receiving API to be capable of downloading the file directly from Blob Storage. This approach bypasses the need to load the file into ADF's memory space entirely.

Use ADF to automate the generation of a SAS token for the Blob Storage file. You can use an Azure Function to generate this if you want to keep it within ADF's workflow
Use the Web Activity in ADF to call the REST API, providing the SAS URL as a parameter. The REST API would then be responsible for downloading the file directly from Blob Storage

Or, if the receiving API cannot be modified to accept a SAS token, consider setting up a custom Azure VM or Azure Container Instance that can handle the file transfer. This intermediate service would download the file from Blob Storage and then upload it to the target REST API without the memory limitations of Azure Functions or ADF:

Deploy a VM or Container Instance with enough resources to handle the file processing requirements
Develop a script or application that runs on the VM/Container, which performs the download from Blob Storage and the upload to the REST API. This script can be triggered by ADF using a Web Activity or a Custom Activity.
Ensure that the VM/Container has the necessary network access to both Blob Storage and the REST API, and implement security measures as required.

Since you mentioned that chunking is not an option, it's worth considering if there's any flexibility in this requirement. If the REST API can be adjusted to accept chunks of data that it then reassembles, you could use ADF to split the file in Blob Storage and send parts sequentially:

If you can swithc the storage to Azure Data Lake Store , I would say it offers higher limits and might provide a more straightforward integration point for large files with other Azure services, potentially simplifying your ingestion pipeline.

JegorsM 0 Reputation points

2024-03-20T11:12:16.5733333+00:00

Thank you for the response!

Unfortunately, it's not possible from my side to adjust the REST API configuration, so the file has to be sent directly and cannot be downloaded from API side.

Regarding the VM implementation, it's a possibility, and I have also considered configuring a seperate Batch job, however the configuration and security seem to be more complex than other solutions would be, but it is something I will be trying.

Regarding Azure Data Lake Store, how would this provide additional integration points for this use case? I'm unfamiliar with it so would be interesting about how this could simplify things in this case.
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2024-03-20T11:26:32.45+00:00

Not in a very direct way but ADLS can handle the storage and analysis of these files more efficiently than Blob Storage, especially for analytics workloads.

Share via

How to post multi-GB data file to REST API using Azure Data Factory?

1 answer

Your answer