Share via

How do I use prefix caching with DeepSeek V4 Pro?

LOSTMSU 31 Reputation points
2026-05-22T14:07:05.3633333+00:00

I tried using both the /completions and /responses APIs against a deployed DeepSeek V4 Pro resource, and despite using a long common prefix I got:

  • from /completions
"usage": {
  "prompt_tokens": 7673,
  "completion_tokens": 2,
  "total_tokens": 7675
}
  • from /responses
"usage": {
  "input_tokens": 7685,
  "input_tokens_details": {
    "cached_tokens": 0
  },
  "output_tokens": 2,
  "total_tokens": 7687
}

Without prefix caching agentic workloads and multiturn chats are unfeasible on Azure.

Foundry Models
Foundry Models

A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference

0 comments No comments

Answer accepted by question author

SRILAKSHMI C 18,745 Reputation points Microsoft External Staff Moderator
2026-05-22T16:05:57.0266667+00:00

Hello @LOSTMSU

Thank you for reaching out.

Based on the behavior you observed, what you are seeing is currently expected for DeepSeek V4 Pro deployments in Azure AI Foundry.

At this time, prefix caching (also referred to as prompt caching) is not currently supported for DeepSeek V4 Pro or most non-Azure OpenAI Foundry models. Because of this, the service will return:

"cached_tokens": 0

even when a large portion of the prompt remains identical across requests.

This applies to both:

  • /completions
  • /responses

APIs.

Currently, Azure prompt caching support is primarily available for supported Azure OpenAI GPT-series models. DeepSeek V4 Pro does not yet expose server-side token reuse/prefix caching capabilities through Azure AI Foundry.

Because of this:

  • There is currently no API parameter, deployment setting, or portal configuration that enables prefix caching for DeepSeek V4 Pro.
  • Reusing the same long prompt prefix will still result in full prompt token processing on each request.
  • cached_tokens will continue to report 0.

For Azure OpenAI GPT models that support prompt caching, the behavior is different. Those models can:

  • reuse previously processed prompt prefixes,
  • report cached token counts,
  • support prompt_cache_key,
  • and reduce repeated prompt-processing cost/latency for agentic or multi-turn workloads.

Regarding your scenario specifically “Without prefix caching agentic workloads and multiturn chats are unfeasible on Azure.”

We understand the concern. For long-context agentic workflows using DeepSeek models, current recommended approaches are typically application-side optimizations such as:

  • maintaining conversation memory externally,
  • sending only incremental/delta context,
  • summarizing older turns,
  • retrieval-based context injection (RAG),
  • or client-side caching/orchestration.

If server-side prompt caching is a hard requirement, you may currently need to evaluate supported Azure OpenAI GPT-series models instead, where prompt caching support is available.

At this time, there is no public ETA for prompt caching support on DeepSeek V4 Pro within Azure AI Foundry.

Please refer to the following documentation for additional details:

Prompt caching overview (Azure OpenAI) https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching

I Hope this helps. Do let me know if you have any further queries.


If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!

Was this answer helpful?

1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.