I need to get data from HTTP APIs in Azure storage Gen 2 account. What should I use?

Aqshata Akshay Ajnadkar 0 Reputation points
2024-08-06T11:52:22.38+00:00

Basically I have HTTP API and I have GraphQL queries through which I can get the data. The limitations are, in one request, we can only get data for a month and only 2000 records at a time. For example if we have 10000 records for July 2024, then I need to send 5 requests to get entire data using GraphQL queries. I'm running this loop dynamically so that it can handle any number of records.

Currently I have a pipeline created in ADF which get monthly data for me. It takes 3hours to get data for a month from source to Azure storage account.

Now, the issue is, I need to implement this for historical data as well. Which starts with getting data for minimum 30months. With current setup, it will take 3hrs*30 = 90hrs time which is practically not acceptable. Can you please help me suggest a way out of this?
Parallel processing might not work as server will start throwing error due to too many requests.

Can you please suggest me efficient way for faster implementation with any azure services?

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
4,856 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,151 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,525 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Vinodh247 16,831 Reputation points
    2024-08-07T13:13:25.52+00:00

    Hi Aqshata Akshay Ajnadkar,

    Thanks for reaching out to Microsoft Q&A.

    To efficiently retrieve historical data from your HTTP APIs and GraphQL queries into ADLSGen2, you can consider using Azure Logic Apps. Logic apps have advantages like the ability to handle complex workflows and orchestration, built-in connectors for HTTP, GraphQL, and ADLS, it supports parallel processing and batching and finally it is scalable and cost-effective solution.

    Tentative steps:

    1. Create an Logic App
    2. Add a trigger to initiate the workflow
      • You can use a recurrence trigger to run the workflow on a schedule
      • Set the recurrence interval to a suitable value (daily, weekly) based on your requirements
    3. Add an HTTP or GraphQL action to fetch data from the APIs
    • Configure the action with the appropriate API endpoint and query parameters
      • Use the Batch size parameter to specify the number of records to fetch per request (ex, 2000)
      • Use the Batch count parameter to specify the number of batches to process in parallel
    1. Add a foreach loop to iterate over the fetched data batches
    2. Inside the loop, add an ADLSGen2 action to upload the data to your storage account
    • Configure the action with the appropriate file path and name
      • Use the Append to file option to append data to an existing file
    1. Save and run the Logic App

    try and let me know if this worked.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.


  2. Pinaki Ghatak 3,830 Reputation points Microsoft Employee
    2024-08-09T09:04:12.46+00:00

    Hello @Aqshata Akshay Ajnadkar

    To efficiently retrieve and store large amounts of data from HTTP APIs in Azure Storage Gen 2 account, you can consider using Azure Functions and Azure Cosmos DB.

    Azure Functions can be used to retrieve data from the HTTP APIs using GraphQL queries and store it in Azure Cosmos DB.

    You can use the Cosmos DB Change Feed to trigger Azure Functions to process the data as it is added to the database. This way, you can process the data in near real-time and avoid the need to retrieve large amounts of data in a single request.

    To handle the limitations of the HTTP APIs, you can use Azure Functions to dynamically generate and execute multiple GraphQL queries to retrieve the data in smaller chunks.

    You can also use Azure Functions to parallelize the retrieval and processing of data to improve performance. Once the data is stored in Azure Cosmos DB, you can use Azure Data Factory to move the data to Azure Storage Gen 2 account.

    You can use the Cosmos DB connector in Azure Data Factory to efficiently retrieve the data from Cosmos DB and store it in Azure Storage Gen 2 account.

    This approach can significantly reduce the time required to retrieve and store large amounts of data from HTTP APIs in Azure Storage Gen 2 account.


    I hope that this response has addressed your query and helped you overcome your challenges. If so, please mark this response as Answered. This will not only acknowledge our efforts, but also assist other community members who may be looking for similar solutions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.