Optimize cost for AI workloads on Azure

This article shows early-stage startup teams how to identify and reduce costs in AI workloads on Microsoft Azure. It's written for the founder, CTO, or first engineering hire who owns the cloud bill and the evaluation (eval) set at the same time. It covers tagging and budget hygiene, the four request-path levers (caching, batching, routing, and model selection), GPU right-sizing for self-hosted inference, multi-tenant retrieval patterns, and a safe-change loop you can run without a dedicated platform team. Each section is tagged with the stage from the Azure architecture guide for startups where it applies (Explore, Expand, or Extract), so you can avoid optimizing for problems you don't yet have.

In this article you'll learn how to:

Identify the top cost drivers in an AI workload on Azure.
Match cost-optimization levers to your startup stage.
Apply prompt caching, semantic caching, batching, model routing, and right-sizing.
Design multi-tenant retrieval and database patterns that scale linearly with revenue, not with usage.
Wrap cost changes in an eval gate, budget alerts, and per-tenant rate limits.
Recognize the early signals that you've outgrown a do-it-yourself approach to cost.

Prerequisites

An Azure subscription with at least one AI workload running in production, staging, or a working prototype.
Owner or Contributor permissions on the resources you want to measure.
Comfort opening the Azure portal. No prior experience with Cost Management or Azure Monitor is required. This article points you to the relevant pages.
A small eval set for your AI feature, with 10 to 50 representative prompts and expected behaviors. If you don't have one yet, see the Related articles section. You can build the first version in an afternoon.

Why this matters for startups

For an early-stage startup, AI cost is operational risk. Cheaper inference frees engineering hours for the next experiment, and a stable cost-per-active-user lets you plan past the next funding milestone instead of the next invoice. The patterns in this article are deliberately small. Each one is achievable by a founding engineer over a weekend, with no platform or FinOps team required.

Important

You don't need a dedicated FinOps team to start. The first 80 percent of cost wins come from tagging everything from day one, putting one person in charge of a weekly Cost Management review, and applying the levers in this article in stage order. Bring in formal FinOps tooling and processes only after spend exceeds about $50,000 per month or covers more than five distinct workloads.

Why AI cost shows up differently than traditional cloud cost

In a traditional web app, your monthly bill is dominated by VMs, databases, and egress. You can usually predict it within 10 percent by knowing how many users you serve. AI workloads break that intuition. The same user can cost $0.001 one minute and $0.40 the next, depending on context length, retrieval depth, and which model the request was routed to.

Four cost shapes recur across most AI products on Azure:

Token spend scales with context length, not user count. A naive retrieval-augmented generation (RAG) prompt can balloon from 800 to 12,000 tokens after one product change.
GPU idle time is the largest hidden cost in self-hosted inference. An A100 left running overnight costs more than a month of a small Postgres database.
Retrieval fan-out from search and vector databases compounds. Every chat turn might issue three to eight hidden queries you never see in your logs.
Egress and storage creep in slowly through model artifacts, embeddings, audit logs, and per-tenant indexes.

Each cost driver has a known set of levers. The remaining sections describe them in priority order, tagged with the startup stage where the lever applies, so teams don't over-engineer for problems they don't yet have.

Tip

Use the cost optimization guidance from the Azure Well-Architected Framework in your architecture to sustain and improve your return on investment (ROI).

The stage map: which levers belong where

The Azure architecture guide for startups describes three stages of product development: Explore, Expand, and Extract. The cost-optimization levers in this article align to those stages. Use the following table to scope which sections apply to your team today and which to defer.

Stage	Headcount	Primary cost goal	Levers that pay off
Explore	1-10 engineers	Optionality and speed	Tagging, prompt caching, cheap default model
Expand	10-50 engineers	Stop linear cost-with-revenue	Semantic cache, scale-to-zero, routing, Batch API
Extract	50+ engineers	Margin, predictability, FinOps	Reservations, dedicated indexes, quantization, per-tenant pricing

Identify your top cost drivers

Before optimizing anything, get a flat view of where money is actually going. In Azure, the fastest path is Cost Management, grouped by service and tag, for the last 30 days.

Tag everything from day one

Tagging is the highest-leverage practice for cost visibility. Without consistent tags, you can't attribute spend to a tenant, a feature, or an environment. The Startup Scale Landing Zone (SSLZ) reference enforces tagging at the landing-zone policy level. Use the same approach for AI resources.

costCenter = product | platform | research
tenant     = <customer-id> | shared
workload   = inference | embedding | training | eval
env        = prod | staging | dev
team       = <owning-team>

Where to look first

Cost driver	Where to find it	Typical share of bill
Tokens (LLM API)	Azure OpenAI metrics > Processed Prompt/Completion Tokens	30-60%
GPUs	VM/AKS node hours by SKU (ND, NC, and NV families)	20-50%
Vector/search	AI Search query units, Cosmos DB RU/s	5-20%
Storage	Blob Storage, Azure Files, and Azure Container Registry for model artifacts	3-10%
Egress	Bandwidth out of region, especially cross-cloud calls	2-15%

Export Cost Management to a storage account daily and connect it to your existing analytics stack. A weekly chart of cost-per-active-user is a reliable signal that an optimization had the intended effect.

Lever 1: Caching, batching, routing, and model selection

Stage: Explore through Extract. Start with caching in Explore, add routing and Batch in Expand, and add fine-grained model selection per tenant in Extract.

Tip

Cache embeddings keyed by the source content hash, and use a smaller, cheaper model, such as GPT-4o mini or an open-weights 7B to 13B model, for first-pass classification or extraction. Escalate to a frontier model only on the requests where the small model is uncertain. This pattern alone often cuts inference cost by 60 to 80 percent without measurable quality loss on routine queries.

Caching

Prompt caching: Azure OpenAI automatically discounts repeated prefixes for prompts of at least 1,024 tokens, supported on GPT-4o and newer models. The first 1,024 tokens must be identical to hit the cache, so keep system prompts and tool definitions stable.
Semantic cache: Store embedding and response pairs in Azure Cache for Redis or Cosmos DB. Return the cached response when a new query has cosine similarity above approximately 0.95.
Output cache: For non-personalized endpoints, such as FAQs and deterministic tools, a simple time-to-live (TTL) cache cuts traffic by 30 to 80 percent.

Batching

Embedding and classification jobs are the obvious candidates. Azure OpenAI Batch API gives a 50 percent discount versus real-time for jobs that can wait up to 24 hours, such as nightly index refreshes, evaluator runs, and async summarization.

Routing

Most products don't need the most expensive model on every call. A router, either rule-based or learned, can send 60 to 80 percent of traffic to a cheaper model with no measurable quality drop.

Pattern	Cheap path	Expensive path
Intent classification	GPT-4o mini or Phi-4	GPT-4o for ambiguous requests
Tool use or function calling	Mid-tier model	Top-tier model on retry
Long-context summarization	Sliding window with mid-tier model	Full-context top-tier model
Code generation	Mid-tier model for boilerplate	Top-tier model for refactors

Model selection

Reevaluate model choice every quarter. Prices and quality move fast. A model that was your only option six months ago might now be five times more expensive than a newer SKU that scores within one to two points on your evals.

Lever 2: Right-size infrastructure with autoscale

Stage: Expand and Extract. In Explore, use serverless or platform as a service (PaaS), such as App Service, Container Apps consumption, or Azure OpenAI Service, and skip this lever.

If you self-host inference with vLLM, Triton, or Text Generation Inference (TGI) on Azure Kubernetes Service (AKS) or Container Apps, your second biggest lever is making sure GPUs aren't idling.

Scale to zero on idle workloads

Set minReplicas: 0 on Container Apps with a GPU workload profile, or use Horizontal Pod Autoscaling (HPA) or KEDA on AKS to scale node pools to zero when no requests are in flight. Cold starts are typically tens of seconds. Benchmark with your model, and keep one warm replica during business hours if user-facing latency matters.

Right-size GPU SKU to model size

Match GPU class to parameter count. T4 or L4 is sufficient for models below approximately 13B parameters. A100 or H100 only pays off for models above approximately 34B parameters or sustained high queries per second (QPS). Container Apps serverless GPU currently supports T4 and A100. L4 and H100 require AKS.

Burst training and batch jobs to spot

Run nightly evals, embedding refreshes, and offline summarization on spot node pools, which are typically 60 to 80 percent cheaper than on-demand. Keep production inference on dedicated capacity. The following table summarizes the autoscale strategies and their typical savings.

Caution

Spot capacity can be evicted with as little as 30 seconds' notice. Only use spot for work that can be checkpointed or restarted cleanly, such as batch evals, embedding refreshes, offline summarization, and fine-tuning with frequent checkpoints. Never put user-facing inference or jobs without restart logic on spot.

Strategy	How	Typical savings
Scale to zero	`minReplicas: 0` on Container Apps with GPU workload profile. Cold starts are typically tens of seconds. Benchmark with your model.	Up to 90%
KEDA on queue depth	Scale on Service Bus or queue messages, not CPU.	30-60%
Right-size SKU	T4 or L4 for models with fewer than 13B parameters. A100 or H100 only for models with more than 34B parameters or high QPS. Container Apps serverless GPU currently supports only T4 and A100. L4 and H100 require AKS.	40-70%
Spot capacity	Spot node pools for batch and eval. On-demand capacity for production.	40-80%
Quantization	AWQ or GPTQ 4-bit quantization to fit larger models on smaller GPUs.	Fit 30B on 16 GB

Note

Scaling to zero on a chat surface adds visible cold-start latency. A common pattern is to keep one to two warm replicas during business hours and scale to zero overnight.

Lever 3: Multi-tenant patterns without retrieval cost spikes

Stage: Late Expand and Extract. In Explore, you almost certainly have one tenant: yourself. Skip this section until you have at least three real customers.

Multi-tenant AI products fail at scale when retrieval and database patterns were chosen for the single-tenant prototype. Three patterns recur.

One index per tenant vs. shared index with filters

A dedicated AI Search index per tenant gives clean isolation but charges for every index even when idle. A shared index with a tenant filter is much cheaper at small and medium scale. Switch to dedicated only for enterprise tier or when a tenant exceeds a defined size threshold.

Vector database choice

Choose your vector store based on existing infrastructure and scale. The following table summarizes when each option fits.

Warning

Deleting a vector index or its underlying store is irreversible, and re-embedding a large corpus can cost hundreds to thousands of dollars in model calls plus hours of engineering time. Before any destructive change to a vector store, snapshot the source documents and verify your re-embedding pipeline runs end-to-end on a small subset.

Option	Best for	Cost shape
Azure AI Search (vector)	Hybrid search and facets	Per-replica, predictable
Cosmos DB (vector)	Teams already using Cosmos DB for app data	RU/s, scales with QPS
pgvector on Postgres	Small or medium corpora, simple operations	Per-VM, very cheap
Dedicated vector database	100M+ vectors, high recall needs	Per-node, expensive

Avoid hidden N+1 retrievals

Every agent step that calls search is a billable query. Log retrieval call counts per user turn and alert when the median exceeds your budget. A good starting target is two or fewer retrievals per turn. Re-rankers and rewriters are easy places to accidentally double traffic.

Governance: keep cost changes safe

Stage: All stages. The lightweight version, which includes a budget, a one-line eval check before deployment, and a single rate limit, belongs in Explore from day one. The heavier version, with CI-blocking eval gates and per-tenant rate limits in API Management, belongs in Expand and beyond.

An optimization that breaks quality isn't an optimization. It's an outage. Wrap every cost change in three guardrails. Each guardrail can be set up in under an hour by a single engineer.

Eval check: Run your eval set before deploying any prompt, model, or routing change. At the early stage, this check can be a script you run manually. Block the deployment or revert if the score drops more than your tolerance, such as one point on a 100-point scale.
Budget alerts: Set Azure Cost Management budgets per resource group with alerts at 50, 80, and 100 percent. Route them to the same Slack or Teams channel that gets your error notifications, so spend and incidents land in the same place.
Request rate limit: Even a single per-IP or per-API-key cap in API Management, NGINX, or your gateway prevents one runaway client from emptying your credit balance overnight. Add per-tenant caps later when you have paying customers.

Be cautious about bundling several cost optimizations into a single release. When the change set lands together, attribution becomes difficult and any regression is expensive to bisect.

The two-lever experiment: how to compare before and after

When you're deciding where to begin, choose two levers from the previous sections, ship them behind a feature flag, and measure for 7 to 14 days. Two levers are sufficient to detect meaningful movement. More than two makes attribution unreliable.

Suggested first pair by stage

Stage	Lever A	Lever B
Prelaunch (<100 DAU)	Prompt caching	Model routing with a cheap default model
Early traction (100-10k DAU)	Semantic cache	Scale-to-zero on inference
Scale (10k+ DAU)	Batch API for async work	Per-tenant index strategy
Enterprise tier	Dedicated indexes for top accounts	Quantized models on L4 or H100

Baseline window:   2026-04-15 to 2026-04-28 (14 days)
Treatment window:  2026-05-01 to 2026-05-14 (14 days)
Levers shipped:    1) semantic cache on /chat
                   2) scale-to-zero on vLLM

Metrics:
  cost_per_active_user   (target: down 30%)
  p95_latency_ms         (guardrail: +<= 150 ms)
  eval_score_delta       (guardrail: >= -1.0)

Decision rule: Keep both if all guardrails hold. Otherwise, revert and ship one at a time.

What this article covers and what it doesn't

This article is intentionally scoped. The following sections list topics that are in scope, topics that are out of scope, and the signals that indicate when to add them.

In scope

Tagging, budgets, and Cost Management practices appropriate for any startup.
The four request-path levers: caching, batching, routing, and model selection.
GPU right-sizing and scale-to-zero for self-hosted inference.
Multi-tenant retrieval patterns for products with three to 100 paying tenants.
A safe-change governance loop: eval gate, budget alerts, and per-tenant rate limits.

Out of scope

Topic	When to add it
Reservations and savings plans for AI compute	The inference bill is steady for 90 days, usually mid-Expand.
Dedicated FinOps tooling, such as Apptio Cloudability, Vantage, and similar tools	Cloud spend exceeds approximately $50,000 per month, or you operate multi-cloud. Most early-stage startups don't need this.
Custom token-based billing per end customer	You sell usage-based pricing, or one tenant exceeds 25 percent of the bill.
Training-cost optimization, such as DeepSpeed and FSDP tuning	You train models in-house. Inference-first products don't need this.
Cross-region or multi-cloud cost arbitrage	You're at Extract stage with proven single-region economics.

When this approach is no longer enough

The practices in this article are designed for small teams running their own cloud. At some point, your business outgrows them. The following signals aren't failures. They're growth. When two or more apply, plan to bring in dedicated tooling or a part-time platform owner.

Monthly Azure spend exceeds approximately $50,000, and AI is more than 30 percent of it.
More than 10 engineers can ship changes that move cost by 5 percent or more.
At least one customer has usage above $10,000 per month and is paying you a flat fee.
Your investors or finance partner have started asking for a monthly cost forecast.
The product runs in more than one Azure region or cloud.

Until then, the lightweight loop in this article, which includes tags, budgets, an eval gate, and a monthly review, is the right tool. Resist the temptation to adopt enterprise FinOps tooling early. It adds process overhead before it adds value.

Reference checklist

Use the following items as a monthly review checklist. Each item maps to a section in this article.

All AI resources are tagged with costCenter, tenant, workload, and env.
A Cost Management dashboard exists, is grouped by tag, and is reviewed weekly.
System prompts are stable enough for prompt-cache hits.
Async work, such as embeddings, evals, and summaries, runs on Batch API.
The router sends at least 60 percent of traffic to a cheaper model with no eval regression.
GPU workloads scale to zero outside business hours or use spot for batch.
The median per-turn retrieval count is two or fewer.
The multi-tenant strategy is chosen explicitly: shared with filter or dedicated.
Budgets and per-tenant rate limits are enforced.
Every prompt, model, or routing change runs the eval gate before merge.

Feedback

Was this page helpful?

Last updated on 2026-05-20