How to post multi-GB data file to REST API using Azure Data Factory?

JegorsM 0 Reputation points
2024-03-18T12:42:07.5466667+00:00

Hello,

I am currently working on a data ingestion pipeline which needs to process some larger files (up to 5GB) and post them to a REST API. The files are stored in Azure Blob Storage.

The data cannot be sent in chunks and so needs to be sent whole, however I'm facing various memory issues in multiple services: Initially tried this in Azure Functions, but faced the 1.5GB memory limit, and in Data Factory both the Lookup to load from dataset and Web activity Response for a GET request to Blob Storage have limits of 4MB.

Is there any other way to load the data into Data Factory without running into these issues?

Thank you in advance for any responses.

Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,408 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2024-03-19T10:25:57.6133333+00:00

    To be honest, your use case is challenging especially when the data needs to be sent as a whole rather than in chunks

    Since that the files are already in Azure Blob Storage, one workaround could be to generate a Shared Access Signature token for the file and then send the URL with the SAS token to the REST API. This method would require the receiving API to be capable of downloading the file directly from Blob Storage. This approach bypasses the need to load the file into ADF's memory space entirely.

    1. Use ADF to automate the generation of a SAS token for the Blob Storage file. You can use an Azure Function to generate this if you want to keep it within ADF's workflow
    2. Use the Web Activity in ADF to call the REST API, providing the SAS URL as a parameter. The REST API would then be responsible for downloading the file directly from Blob Storage

    Or, if the receiving API cannot be modified to accept a SAS token, consider setting up a custom Azure VM or Azure Container Instance that can handle the file transfer. This intermediate service would download the file from Blob Storage and then upload it to the target REST API without the memory limitations of Azure Functions or ADF:

    1. Deploy a VM or Container Instance with enough resources to handle the file processing requirements
    2. Develop a script or application that runs on the VM/Container, which performs the download from Blob Storage and the upload to the REST API. This script can be triggered by ADF using a Web Activity or a Custom Activity.
    3. Ensure that the VM/Container has the necessary network access to both Blob Storage and the REST API, and implement security measures as required.

    Since you mentioned that chunking is not an option, it's worth considering if there's any flexibility in this requirement. If the REST API can be adjusted to accept chunks of data that it then reassembles, you could use ADF to split the file in Blob Storage and send parts sequentially:

    If you can swithc the storage to Azure Data Lake Store , I would say it offers higher limits and might provide a more straightforward integration point for large files with other Azure services, potentially simplifying your ingestion pipeline.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.