How do I use prefix caching with DeepSeek V4 Pro?

Question

How do I use prefix caching with DeepSeek V4 Pro?

LOSTMSU 31

I tried using both the /completions and /responses APIs against a deployed DeepSeek V4 Pro resource, and despite using a long common prefix I got:

from /completions

"usage": {
  "prompt_tokens": 7673,
  "completion_tokens": 2,
  "total_tokens": 7675
}

from /responses

"usage": {
  "input_tokens": 7685,
  "input_tokens_details": {
    "cached_tokens": 0
  },
  "output_tokens": 2,
  "total_tokens": 7687
}

Without prefix caching agentic workloads and multiturn chats are unfeasible on Azure.

0 comments

Answer accepted by question author

0 additional answers

Your answer

Answer 1

Hello @LOSTMSU

Thank you for reaching out.

Based on the behavior you observed, what you are seeing is currently expected for DeepSeek V4 Pro deployments in Azure AI Foundry.

At this time, prefix caching (also referred to as prompt caching) is not currently supported for DeepSeek V4 Pro or most non-Azure OpenAI Foundry models. Because of this, the service will return:

"cached_tokens": 0

even when a large portion of the prompt remains identical across requests.

This applies to both:

/completions
/responses

APIs.

Currently, Azure prompt caching support is primarily available for supported Azure OpenAI GPT-series models. DeepSeek V4 Pro does not yet expose server-side token reuse/prefix caching capabilities through Azure AI Foundry.

Because of this:

There is currently no API parameter, deployment setting, or portal configuration that enables prefix caching for DeepSeek V4 Pro.
Reusing the same long prompt prefix will still result in full prompt token processing on each request.
cached_tokens will continue to report 0.

For Azure OpenAI GPT models that support prompt caching, the behavior is different. Those models can:

reuse previously processed prompt prefixes,
report cached token counts,
support prompt_cache_key,
and reduce repeated prompt-processing cost/latency for agentic or multi-turn workloads.

Regarding your scenario specifically “Without prefix caching agentic workloads and multiturn chats are unfeasible on Azure.”

We understand the concern. For long-context agentic workflows using DeepSeek models, current recommended approaches are typically application-side optimizations such as:

maintaining conversation memory externally,
sending only incremental/delta context,
summarizing older turns,
retrieval-based context injection (RAG),
or client-side caching/orchestration.

If server-side prompt caching is a hard requirement, you may currently need to evaluate supported Azure OpenAI GPT-series models instead, where prompt caching support is available.

At this time, there is no public ETA for prompt caching support on DeepSeek V4 Pro within Azure AI Foundry.

Please refer to the following documentation for additional details:

Prompt caching overview (Azure OpenAI) https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!

LOSTMSU 31 Reputation points

2026-05-25T11:57:45.9933333+00:00

TL;DR; is that Azure Foundry does not support it and does not tell the caller about it.

It should be returning an error when caching is requested but can not be provided.

The "current recommended approaches are typically application-side optimizations" suggestion does not make any sense. As I stated above, agentic workloads are not feasible without prefix caching due to costs growing quadratically. None of the suggested options are replacement for prefix caching. If your workload is agentic or multiturn and you need to use open models for any reason, you have to use a provider that supports prefix caching.

Share via

How do I use prefix caching with DeepSeek V4 Pro?

0 additional answers

Your answer