A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference
Hello @Justin Encabo
Thank you for Reaching out to Microsoft Q&A,
Based on the current Azure AI Foundry serverless deployment behavior, models such as:
• DeepSeek V4 Pro
• MiniMax 2.7 via Fireworks
are currently billed on a standard pay-per-use/per-token basis, and there is no broadly documented or guaranteed built-in prompt caching discount mechanism exposed through Azure AI Foundry serverless endpoints today.
In practical terms, you should generally assume:
• Every request is processed independently
• Repeated prompts/shared prompt prefixes are effectively treated as cache misses from a billing perspective
• There is currently no customer-visible cache hit/miss telemetry or cached-token discount model exposed through the Foundry serverless abstraction layer for these provider integrations
While some underlying model providers may internally use KV-cache optimizations for runtime efficiency, Azure AI Foundry serverless deployments do not currently expose deterministic prompt-cache billing discounts similar to some native provider APIs.
Because these models are served through third-party provider integrations (for example via Fireworks), caching semantics can vary depending on:
• The underlying provider implementation
• Whether internal KV-cache reuse exists
• Whether cache reuse is request-scoped or session-scoped
• Whether Azure Foundry passes through any provider-native caching capabilities
• Whether billing systems expose cached-token pricing separately
For your scenario using Foundry endpoints for AI-driven development and agentic workflows such as Pi Coding Agent this is especially important because repeated long-context prompts can significantly increase token consumption and cost.
At this time, Azure AI Foundry serverless endpoints should generally be treated as:
“Full token billing per invocation unless explicit caching support is documented for that model/provider.”
If you would like to reduce or optimize costs, here are a few approaches you can consider:
- Implement your own application-side cache layer
• If your agent frequently submits identical or near-identical prompts, you can cache responses in your own database or application layer
• This is often the most effective workaround today for agentic workflows with repeated context reuse - Consider provisioned/reserved capacity options
• If your workload volume becomes more predictable, provisioned or reserved-capacity deployments may provide better cost efficiency compared to pure serverless pay-per-token usage - Explore managed/real-time endpoints
• Managed compute deployments give you more control over runtime behavior and allow you to implement your own warm-worker or caching strategies within the application/service layer - Minimize repeated static context
• For agentic harnesses, reducing repeated system prompts or shared context can significantly reduce token spend when no caching discounts exist - Compare with native provider APIs
• Some providers may expose prompt caching or cached-token billing more explicitly through their direct APIs than through the Azure Foundry abstraction layer
At the moment, there is no publicly documented indication that DeepSeek V4 Pro in Foundry or MiniMax 2.7 via Fireworks
Currently support customer-visible prompt caching discounts through Azure AI Foundry serverless deployments.
Please refer this
Deploy Models via Serverless API: https://learn.microsoft.com/azure/ai-foundry/how-to/deploy-models-serverless
Microsoft Foundry Models overview (serverless deployments): https://learn.microsoft.com/azure/foundry/concepts/foundry-models-overview#serverless-deployments
Deploy Models via Managed Compute (real-time endpoint): https://learn.microsoft.com/azure/ai-foundry/how-to/deploy-models-managed?tabs=azure-studio#deploy-open-models
Thank you!