Do 20mln of REST API calls using Azure Cloud infrastructure

LucID 5 Reputation points

There is a requirement to get the data, which is around 20 mln of entities, from the REST api, which does not have any rate limits and is insanely fast.

  • Each entity has to be queries with two API calls - get metadata, then get the actual data.
  • The size of overall Entity vary from 1kb to 100Mb.
  • Entities must be put to Blob Storage account in CSV format with any size of batches: 100 items in one file, or 1000 - does not matter for them.
  • Azure Services must be used for this.
  • The budget for this one-time work goes up to 10k USD to be spent on services.

Team tried the following. Note: using Premium Storage account, using Linux Premium v3 P3V3 app-plan

  1. Simple way
  • Function App to issue 20 mln IDs to eventhub
  • another function to listen for event hub and do these two calls + put to BLOB storage (not as Output binding, as the size of items may be up to 100Mb, just using Blob Client)

Summary - takes ages even with scaled to 20 instances. And rarely, but misses some items because of rare network exceptions.

  1. Durable Functions
  • HttpTrigger that accepts the "signal" to kick off with a chunk size, this starts the Orchestration function
  • Orchestration calculates number of batches based on input batch size (tried 100k/10k/1k - same results) and issues N number of Sub orchestrations
  • each sub-orchestration triggers an Activity that does the API call and put the data to Blob
  • All is backed up by MSSQL storage (16-vCore, Gen5 SQL Server instance)

Summary - runs few times faster then option 1, but SQL Server is a bottleneck, often times out. Also tries with Storage Account - few times slower.

  1. Synapse pipelines. As later one these downloaded files would be uploaded to Dedicated SQL Pool in Synapse for various analysis and reporting.
  • The pipeline with REST calls and ForEach cycles
  • Without limits of execution

Summary - takes ages and costs a lot

These are all options we tried so far.

Ideally, it would be nice if it can perform this job in few hrs, or 1-2 days max. But now it takes several days with Option 2 based on Durable Functions.

Dear community, please advise, perhaps there are any other services/tools to use to achieve this interesting goal?

To clarify again: we must call REST API to get the data, it is not possible to access vendor's databases. The Database was mentioned ONLY as a storage for metadata for Azure Durable Functions.

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
3,459 questions
Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
275 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
3,557 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. LucID 5 Reputation points

    I think we found a way - create a simple Http-triggered function app in .NET that accepts a a definition of job - paging arguments: start index and number of items to fetch, inside it also splits the bulks of 100 calls per each and uses Parallel-call. It also writes inside to a Premium BLOB storage account.

    Then designed a Synapse Pipeline that has an Until loop and accepts a number of batches to be called and inside it calls another pipeline.

    The latter pipeline execute a function app supplying the batch details. For Until, removed "Wait for completion" and inside Foreach, removed "Sequenced".

    The function must be in Consumption Plan (Y1).

    It did that job in around 3hrs, which is amazing! Thanks, Azure!

    1 person found this answer helpful.
    0 comments No comments

  2. Amira Bedhiafi 4,631 Reputation points

    Azure Batch allows large-scale parallel and high-performance computing tasks.

    Create a batch job that contains 20 million tasks. Each task is responsible for a single REST API call to fetch metadata, then the actual data, and subsequently put it into Blob Storage.

    Azure Batch supports retry policies, so you can handle transient failures and network blips without losing any data.

    It also supports auto-scaling. Depending on the velocity of the API you're interacting with, you can scale up the number of nodes to maximize throughput. Start with a smaller number and monitor the success rate, then gradually scale up.

    I recommend starting with smaller compute node sizes and scale out, rather than using large nodes from the start.

    You've mentioned storing these entities in blob storage in CSV format. The Azure Batch task can utilize Azure SDK to write directly to Azure Blob storage.

    Rather than writing each entity individually, batch them into reasonably sized chunks. For example, you could batch every 1000 entities and then write them as a single CSV to Blob storage. This will reduce the number of write operations and improve efficiency.

    For monitoring and logging set up they are essential for tracking which entities have been successfully retrieved and stored, and which ones might have failed.

    For Azure Application Insights, it can be integrated with Azure Batch to provide a detailed look into your application's operations and diagnose errors without affecting the user’s experience.

    Make sure you're leveraging parallel processing capabilities as much as possible. For example, if you're using Python, you can use libraries like asyncio to make asynchronous API calls.

    1. Network Optimization:

    Ensure that the Azure region you're working in is closest to the REST API's host to reduce latency.

    I advice you to be careful, Azure Batch can potentially spin up a large number of nodes, which can quickly consume your budget.

    If you are looking for other alternatives :

    Instead of Functions or Durable Functions, you might consider using Azure Logic Apps. Logic Apps have built-in connectors for Azure services like Blob Storage and provide a visual way to design workflows. They support batching, parallelism, and retries, which are essential for this task.

    0 comments No comments

  3. MikeUrnun 6,766 Reputation points

    Hello @LucID - Thanks for reaching out. Indeed there are many computing options across FaaS/PaaS/IaaS tiers for the kind of job you're about to undertake and I'll leave the exact estimation and calculations to you & rest of the community answers. If you have any questions or need clarifications on how to estimate billing, with specific parameters and resiliency expectations, etc, we'd be happy to weigh in and help.

    Pricing component aside though, personally, I'd give Logic Apps serious consideration for its quicker time-to-solution, flexible hosting model (via Standard vs Consumption SKUs - priced differently), and the strong factor that it's the very integration tool by-design from Azure for the kind of one-time job described in your post.

    0 comments No comments