An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello Cortaxiom,
Welcome to Microsoft Q&A,
When you pass previous_response_id, Azure OpenAI reconstitutes the full conversation context server-side and injects all prior turns into the model's input window. Those tokens, including your 19k system prompt, are counted as input tokens on every follow-up request. They are not excluded from billing.
However, they are very likely to qualify for automatic prompt caching, which is where your cost savings come from.
Because the prefix of the reconstructed prompt is identical across calls (same system prompt, same prior turns), Azure OpenAI's prompt caching kicks in automatically with no configuration required. Tokens that match a cached prefix are billed at the cached input token rate, which is a discount over the standard input rate for Standard deployments, and can be up to a 100% discount on Provisioned deployments.
The exact discount varies by model and is listed on the Azure OpenAI pricing page.
- The repeated prefix must be at least 1,024 tokens long. At 19k tokens, your system prompt clears this threshold with room to spare.
- Cache hits occur in increments of 128 tokens after the initial 1,024.
- A single character change in the first 1,024 tokens causes a cache miss.
- Caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of last use.
By default, response data is stored via store: true is retained for 30 days. There is no documented parameter to configure a shorter or longer retention window. If you need to remove a stored response before the 30-day window expires, you can delete it explicitly.
To extend prompt cache retention beyond the default in-memory window, Azure OpenAI also supports extended prompt caching via the prompt_cache_retention parameter on the Responses API.
https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching
Please Upvote and accept the answer if it helps!!