A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference
Hello Justin Encabo,
Greetings! Thanks for raising this question in Q&A forum.
Great question, and the good news is that the answer is much better than you might expect! Yes, DeepSeek V4 Pro and MiniMax models via Fireworks on Microsoft Foundry do have prompt caching pricing — with a dedicated cached token rate that is significantly lower than the regular input token rate. Let me break this down clearly for you.
DeepSeek V4 Pro — Prompt Caching on Foundry
DeepSeek V4 Pro on Microsoft Foundry via Fireworks has the following serverless pricing: $1.75 per 1M input tokens, $0.15 per 1M cached tokens, and $3.48 per 1M output tokens.
So cached tokens are billed at roughly 91% less than regular input tokens — a very significant discount that makes agentic use cases with repeated context much more cost-efficient.
MiniMax / Kimi K2.6 — Prompt Caching on Foundry
Kimi K2.6 (MoonshotAI) has serverless pricing of $0.95 per 1M input tokens, $0.16 per 1M cached tokens, and $4.00 per 1M output tokens. Both models are available per-token serverless and via PTU through the Foundry model catalog with a single Azure endpoint and the same enterprise controls.
How does the caching work on Fireworks models?
For Fireworks models, cached input tokens are by default priced at 50% for all text and vision language models unless otherwise specified. However, as you can see from the Foundry-specific pricing above, DeepSeek V4 Pro and Kimi K2.6 on Foundry have even deeper cached token discounts (around 91%) compared to the standard 50% discount.
Practical tips for your Pi Coding Agent agentic harness
To maximize cache hits and minimize costs in your agentic workflow, here are the key things to keep in mind:
Step 1: Structure your prompts for cache hits Keep the beginning of your prompts — especially the system prompt, static context, tools definitions, and code context — consistent and identical across calls. The caching mechanism rewards prompts where the prefix is unchanged between requests.
Step 2: Use serverless pay-per-token for prototyping Serverless pay-per-token inference is ideal for experimenting securely and quickly with Data Zone Standard — this is exactly the right option for prototyping with your balance without committing to PTUs.
Step 3: Switch to PTUs when your usage patterns stabilize Once you've validated your agentic harness and have predictable throughput, provisioned throughput units (PTUs) offer predictable, steady-state performance for base or custom models which is better for production agentic workflows.
Step 4: Monitor your cached vs non-cached token usage In the Foundry portal, go to your project > Monitoring to track token usage. Look at the ratio of cached to total input tokens — a good agentic setup with stable system prompts should achieve high cache hit rates over time, keeping your costs very close to the cached token rate.
Step 5: Check for the latest pricing before committing Pricing for these models can change. Always verify the current rates at the Microsoft Foundry model catalog page for DeepSeek V4 Pro and MiniMax before planning your budget. The pricing shown above is current as of May 2026.
If this answer helps you kindly accept the answer which will help others who have similar questions.
Best Regards,
Jerald Felix.