An Azure service that provides a general-purpose, serverless container platform.
Hello Arun !
Thank you for posting on Microsoft Learn.
While D32 (32 vCPU, 128 GB RAM) offers strong CPU performance, it may fall short for your use case involving real-time image processing across three AI models.
YOLOv8-Nano run reasonably well on CPU, but YOLOv8-Medium, PaddleOCR, and especially BLIP are computationally intensive. CPU inference times can cause significant latency, especially when concurrency increases.
Each model has different compute profiles:
- YOLOv8-Medium struggles on CPU under real-time demands and works far better on a GPU
- PaddleOCR can work on CPU if optimized, but benefits hugely from GPU acceleration
- BLIP-vqa is transformer-based and performs very poorly on CPU, often taking 2–3 seconds per image. On GPU, inference can drop to 200–500 ms
For challenging scenarios, CPU bottlenecks quickly become a problem. GPU inference allows you to meet both latency and concurrency requirements reliably.
Unfortunately, Azure Container Apps does not support GPU SKUs natively as of now. It’s designed for general-purpose workloads and lightweight containerized apps, not for inference-heavy tasks involving models like BLIP or YOLOv8-M.
Think about migrating to platforms that natively support GPUs:
- Azure Kubernetes Service (AKS) with GPU node pools (NVIDIA T4, A10G, or A100)
Azure ML Managed Online Endpoints that allow you to deploy models on GPU-backed compute
Optionally, Azure VMs (like NC-series or ND-series) if you want manual control
I see that Microsoft and open-source benchmarks show that:
- YOLOv8-M can run at 10–20 FPS on a T4 GPU vs 2–3 FPS on CPU
BLIP-vqa sees 4–10× latency improvements on GPU
PaddleOCR processes images in under 100 ms on GPU, compared to up to 1s on CPU
A good rule of thumb if inference latency on CPU exceeds 500 ms per model, or if concurrency > 1–2 QPS per model, consider GPU.
If your goal is real-time processing at scale, sticking with D32 is not sustainable especially due to BLIP-vqa demands.
For high-performance, low-latency pipelines, GPU-backed AKS or Azure ML endpoints are the most suitable options. You can also scale GPU workloads dynamically to manage costs during off-peak hours.
Some links to help you understand better :