Share via

Is Azure Container Apps with D32 (32vCPU, 128GB RAM) suitable for running YOLOv8, PaddleOCR, and BLIP models, or should we consider GPU-based alternatives?

Arun Siripuram 911 Reputation points
2025-06-11T08:59:30.5833333+00:00

We are evaluating the right compute environment for deploying a multi-model AI workload inside Azure Container Apps. The models include:

  • YOLOv8-Medium and YOLOv8-Nano (for object and axle detection)
  • PaddleOCR v3.0 (with PaddlePaddle 2.5.0)
  • BLIP-vqa-base (fine-tuned for FASTag dataset)

Our current container app is running in a D32 (32 vCPU, 128 GB RAM) CPU-based Dedicated plan. The use case involves:

  • Accepting high volumes of image requests per second
  • Processing those images through the above models in real-time
  • Calling Azure Document Intelligence for document parsing

❓ We would like to understand:

Is a D32 CPU environment sufficient to support this pipeline with acceptable latency and concurrency?

Are there any benchmarks or guidance from Microsoft that indicate when to move from CPU to GPU-based workloads (e.g., via AKS with GPU nodes or Azure ML)?

  1. Is Azure Container Apps (CPU-only) ideal for inference-heavy models like YOLOv8 and BLIP, or should we consider alternatives with better GPU integration?

We’d appreciate any insights or data points that can help us decide whether to:

Stick with high-end CPU like D32 for performance and cost

  • Or transition to a GPU-backed setup for real-time processing reliability
Azure Container Apps
Azure Container Apps

An Azure service that provides a general-purpose, serverless container platform.

0 comments No comments

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 41,386 Reputation points MVP Volunteer Moderator
    2025-06-11T09:50:27.5733333+00:00

    Hello Arun !

    Thank you for posting on Microsoft Learn.

    While D32 (32 vCPU, 128 GB RAM) offers strong CPU performance, it may fall short for your use case involving real-time image processing across three AI models.

    YOLOv8-Nano run reasonably well on CPU, but YOLOv8-Medium, PaddleOCR, and especially BLIP are computationally intensive. CPU inference times can cause significant latency, especially when concurrency increases.

    Each model has different compute profiles:

    • YOLOv8-Medium struggles on CPU under real-time demands and works far better on a GPU
    • PaddleOCR can work on CPU if optimized, but benefits hugely from GPU acceleration
    • BLIP-vqa is transformer-based and performs very poorly on CPU, often taking 2–3 seconds per image. On GPU, inference can drop to 200–500 ms

    For challenging scenarios, CPU bottlenecks quickly become a problem. GPU inference allows you to meet both latency and concurrency requirements reliably.

    Unfortunately, Azure Container Apps does not support GPU SKUs natively as of now. It’s designed for general-purpose workloads and lightweight containerized apps, not for inference-heavy tasks involving models like BLIP or YOLOv8-M.

    Think about migrating to platforms that natively support GPUs:

    • Azure Kubernetes Service (AKS) with GPU node pools (NVIDIA T4, A10G, or A100)

    Azure ML Managed Online Endpoints that allow you to deploy models on GPU-backed compute

    Optionally, Azure VMs (like NC-series or ND-series) if you want manual control

    I see that Microsoft and open-source benchmarks show that:

    • YOLOv8-M can run at 10–20 FPS on a T4 GPU vs 2–3 FPS on CPU

    BLIP-vqa sees 4–10× latency improvements on GPU

    PaddleOCR processes images in under 100 ms on GPU, compared to up to 1s on CPU

    A good rule of thumb if inference latency on CPU exceeds 500 ms per model, or if concurrency > 1–2 QPS per model, consider GPU.

    If your goal is real-time processing at scale, sticking with D32 is not sustainable especially due to BLIP-vqa demands.

    For high-performance, low-latency pipelines, GPU-backed AKS or Azure ML endpoints are the most suitable options. You can also scale GPU workloads dynamically to manage costs during off-peak hours.

    Some links to help you understand better :

    https://learn.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture#gpu-inference

    https://learn.microsoft.com/en-us/azure/aks/gpu-cluster

    https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-online-endpoints?view=azureml-api-2


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.