Azure Batch allows large-scale parallel and high-performance computing tasks.
Create a batch job that contains 20 million tasks. Each task is responsible for a single REST API call to fetch metadata, then the actual data, and subsequently put it into Blob Storage.
Azure Batch supports retry policies, so you can handle transient failures and network blips without losing any data.
It also supports auto-scaling. Depending on the velocity of the API you're interacting with, you can scale up the number of nodes to maximize throughput. Start with a smaller number and monitor the success rate, then gradually scale up.
I recommend starting with smaller compute node sizes and scale out, rather than using large nodes from the start.
You've mentioned storing these entities in blob storage in CSV format. The Azure Batch task can utilize Azure SDK to write directly to Azure Blob storage.
Rather than writing each entity individually, batch them into reasonably sized chunks. For example, you could batch every 1000 entities and then write them as a single CSV to Blob storage. This will reduce the number of write operations and improve efficiency.
For monitoring and logging set up they are essential for tracking which entities have been successfully retrieved and stored, and which ones might have failed.
For Azure Application Insights, it can be integrated with Azure Batch to provide a detailed look into your application's operations and diagnose errors without affecting the user’s experience.
Make sure you're leveraging parallel processing capabilities as much as possible. For example, if you're using Python, you can use libraries like asyncio
to make asynchronous API calls.
- Network Optimization:
Ensure that the Azure region you're working in is closest to the REST API's host to reduce latency.
I advice you to be careful, Azure Batch can potentially spin up a large number of nodes, which can quickly consume your budget.
If you are looking for other alternatives :
Instead of Functions or Durable Functions, you might consider using Azure Logic Apps. Logic Apps have built-in connectors for Azure services like Blob Storage and provide a visual way to design workflows. They support batching, parallelism, and retries, which are essential for this task.