Share via

Caching in Microsoft Foundry serverless deployments

Justin Encabo 0 Reputation points
2026-05-28T06:22:01.8666667+00:00

Hello. I would like to ask if Microsoft Foundry models like DeepSeek V4 Pro and MiniMax 2.7via Fireworks have prompt caching discounts. I want to personally use my balance and use Foundry as an endpoint for prototyping with AI-driven development with an agentic harness (Pi Coding Agent). Are these models available for prompt caching discounts? Or is everything going to be a cache miss?

Foundry Models
Foundry Models

A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference

0 comments No comments

2 answers

Sort by: Most helpful
  1. Jerald Felix 12,050 Reputation points Volunteer Moderator
    2026-05-28T07:43:17.6333333+00:00

    Hello Justin Encabo,

    Greetings! Thanks for raising this question in Q&A forum.

    Great question, and the good news is that the answer is much better than you might expect! Yes, DeepSeek V4 Pro and MiniMax models via Fireworks on Microsoft Foundry do have prompt caching pricing — with a dedicated cached token rate that is significantly lower than the regular input token rate. Let me break this down clearly for you.

    DeepSeek V4 Pro — Prompt Caching on Foundry

    DeepSeek V4 Pro on Microsoft Foundry via Fireworks has the following serverless pricing: $1.75 per 1M input tokens, $0.15 per 1M cached tokens, and $3.48 per 1M output tokens.

    So cached tokens are billed at roughly 91% less than regular input tokens — a very significant discount that makes agentic use cases with repeated context much more cost-efficient.

    MiniMax / Kimi K2.6 — Prompt Caching on Foundry

    Kimi K2.6 (MoonshotAI) has serverless pricing of $0.95 per 1M input tokens, $0.16 per 1M cached tokens, and $4.00 per 1M output tokens. Both models are available per-token serverless and via PTU through the Foundry model catalog with a single Azure endpoint and the same enterprise controls.

    How does the caching work on Fireworks models?

    For Fireworks models, cached input tokens are by default priced at 50% for all text and vision language models unless otherwise specified. However, as you can see from the Foundry-specific pricing above, DeepSeek V4 Pro and Kimi K2.6 on Foundry have even deeper cached token discounts (around 91%) compared to the standard 50% discount.

    Practical tips for your Pi Coding Agent agentic harness

    To maximize cache hits and minimize costs in your agentic workflow, here are the key things to keep in mind:

    Step 1: Structure your prompts for cache hits Keep the beginning of your prompts — especially the system prompt, static context, tools definitions, and code context — consistent and identical across calls. The caching mechanism rewards prompts where the prefix is unchanged between requests.

    Step 2: Use serverless pay-per-token for prototyping Serverless pay-per-token inference is ideal for experimenting securely and quickly with Data Zone Standard — this is exactly the right option for prototyping with your balance without committing to PTUs.

    Step 3: Switch to PTUs when your usage patterns stabilize Once you've validated your agentic harness and have predictable throughput, provisioned throughput units (PTUs) offer predictable, steady-state performance for base or custom models which is better for production agentic workflows.

    Step 4: Monitor your cached vs non-cached token usage In the Foundry portal, go to your project > Monitoring to track token usage. Look at the ratio of cached to total input tokens — a good agentic setup with stable system prompts should achieve high cache hit rates over time, keeping your costs very close to the cached token rate.

    Step 5: Check for the latest pricing before committing Pricing for these models can change. Always verify the current rates at the Microsoft Foundry model catalog page for DeepSeek V4 Pro and MiniMax before planning your budget. The pricing shown above is current as of May 2026.

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    Was this answer helpful?

    0 comments No comments

  2. AI answer

    2026-05-28T06:22:16.6366667+00:00

    Prompt caching support and discounts are determined by whether the underlying model and deployment type support Azure OpenAI-style prompt caching, not just by being a Microsoft Foundry model.

    From the available information:

    • Prompt caching is described and priced specifically for Azure OpenAI models (for example GPT‑4o, GPT‑4.1, GPT‑5.x series) under Foundry/OpenAI endpoints. Cached tokens are billed at a discount for Standard deployments and up to 100% discount for Provisioned deployments, with behavior controlled via parameters like prompt_cache_retention and (optionally) prompt_cache_key.
    • The prompt caching documentation lists only GPT‑4.x and GPT‑5.x families as supporting extended prompt cache retention, and states that all Azure OpenAI models GPT‑4o or newer support in‑memory prompt cache retention.
    • Fireworks-based models (such as DeepSeek and MiniMax variants) are listed separately in the Foundry catalog as partner models, with per‑token or PTU offers, but there is no mention in the provided material that they participate in Azure OpenAI prompt caching or receive prompt caching discounts.

    Based on the context provided, prompt caching discounts are documented only for Azure OpenAI models (GPT‑4o and newer, GPT‑4.1, GPT‑5.x families) and not for Fireworks partner models like DeepSeek or MiniMax. For those partner models, behavior should be assumed to be regular per‑token billing with no documented prompt caching discount, and any reuse of computation via caching is not described.

    To get prompt caching discounts for an agentic development harness, use one of the Azure OpenAI models that explicitly support prompt caching (for example GPT‑4o+ or GPT‑4.1 / GPT‑5.x families) on Standard or Provisioned deployments and structure prompts so that the first 1,024 tokens are identical across calls.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.