Azure OpenAI API Caching Issue with Model `gpt-4o-mini-2024-07-18`
There is an issue being faced with the Azure OpenAI service.
Using the OpenAI model version gpt-4o-mini-2024-07-18
and Azure API version 2024-10-21
, it was noted from Azure's documentation that both the OpenAI model and API versions should be eligible for caching.
The setup includes a static system prompt and a dynamic user prompt, with the system prompt being about 2000 tokens, which should enable caching.
However, after making over 50 API calls (not all concurrent), the OpenAI API reflected about 70% cached tokens, while the Azure OpenAI API showed a mere 0.1% cached tokens. Is this a recognized issue, and is anyone else experiencing similar results?
Azure OpenAI Service
-
Saideep Anchuri • 9,425 Reputation points • Microsoft External Staff • Moderator
2025-02-04T10:23:03.4533333+00:00 Welcome to Microsoft Q&A Forum, thank you for posting your query here!
The issue you are experiencing with the Azure OpenAI service regarding caching with the
gpt-4o-mini-2024-07-18
model may stem from several factors. While both the model and API versions you mentioned are eligible for caching, the effectiveness of caching depends on the structure of the prompts.You're using a mix of a static system prompt and a dynamic user prompt, with each setup being 2,000 tokens. To enable caching, the first 1,024 tokens of the prompt need to be identical. Even a single character difference will cause a cache miss, showing a cached tokens value of 0. Caching is automatically enabled for supported models with no extra configuration needed.
To ensure caching, make sure the first few prompts are the same and make up 1,024 tokens in total.
kindly refer below link: prompt-caching
Thank You.
-
Dilshan Sandhu • 15 Reputation points
2025-02-04T12:11:48.3566667+00:00 Hi @Saideep Anchuri ,
Thank you the prompt reply.The SystemMessage part of the prompt is exactly same between the requests. It is a static text. There is no object-notation present ({}). Even if i do 50 requests in a loop (with or without concurrent requests), I mostly get 0 in cached_tokens. Here is a stat (of about 50 requests, with same user prompt (role: user) as well, SystemPrompt (role: system) remains the same always.):
Azure OpenAI:{'total_tokens': 228560, 'cached_tokens': 18816}
OpenAI:{'total_tokens': 228560, 'cached_tokens': 201600}
All i am changing in my code is from
self.client = AzureOpenAI( api_key=openai.api_key, api_version=api_version, azure_endpoint=azure_endpoint, )
to
self.client = OpenAI(api_key=openai.api_key)
No change other than this. If required, i can provide a minimum reproducible code.
OpenAI sdk version:
1.6.1
As recommended by you, i have read the documentation multiple times. Only two conditions are required to be satisfied:
- At least 1024 tokens
- Prompt prefix must remain same (upto at least 1024 tokens)
I believe, my testing examples follow both the criterias as the system prompt in itself is over 2000 tokens and it does not change.
Regards,
-
JAYA SHANKAR G S • 4,035 Reputation points • Microsoft External Staff • Moderator
2025-02-05T08:35:44.3866667+00:00 @Dilshan Sandhu Did you check the quota and limit of you subscription plans in azure openai? please check it once, meanwhile i will find more information regarding your case.
-
JAYA SHANKAR G S • 4,035 Reputation points • Microsoft External Staff • Moderator
2025-02-06T04:49:13.4066667+00:00 Hi @Dilshan Sandhu ,
Here are the things you need to check.
Cache Inactivity Timeout: Prompt caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. If there's a delay of more than an hour between your requests, the cache might expire, leading to fewer cache hits.
Prompt Consistency: Even though you have more than 1024 tokens, it may not be same/consistent while creating a prompt. Ensure that the first 1,024 tokens of your prompt are identical across all requests. Even a single character difference can result in a cache miss, which doesn't use the cached prompt.
For more understanding the issue please add the reproducible code of creating the prompt.
-
Dilshan Sandhu • 15 Reputation points
2025-02-06T06:58:24.5033333+00:00 prompt_caching.txtHi @Saideep Anchuri ,
I have attached a python file which is a slight modification of the code provided by openai (https://cookbook.openai.com/examples/prompt_caching101).
At the top of the file, you can switch between azure and openai clients and run the script.
I received the following results at the end:
OpenAI:Total: 12714 | Cached: 10880 | Percentage cached: 0.855750
Azure:
Total: 12533 | Cached: 3200 | Percentage cached: 0.255326
Note: The earlier results shared were on my application code (so they vary). Also one modification was made that this time I switched to a different Azure deployment, and with that change, Percentage Cached increased. I don't know why did that happen because model version and api version are same. Keeping that aside, Percentage cached is still way off from what OpenAI numbers suggest.
OpenAI sdk version:
1.61.0
Azure Model version:
2024-07-18
Azure API version:
2024-10-21
Model:
gpt-4o-mini
To your points:
- Cache Inactivity timeout does not matter because the script completes in under a minute.
- According to the OpenAI's document shared above, they said tool caching also happens, so system message + tools token count is greater than 1024, so caching does happen.
Please let me know if something else is required from my side.
-
JAYA SHANKAR G S • 4,035 Reputation points • Microsoft External Staff • Moderator
2025-02-07T04:31:36.17+00:00 Hi @Dilshan Sandhu ,
We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.
Thank you
-
navba-MSFT • 27,540 Reputation points • Microsoft Employee • Moderator
2025-02-17T04:55:15.32+00:00 @Dilshan Sandhu Thanks for sharing the details.
I used the sample code which you had shared above but unable to reproduce this issue at my end as the cached token always appears as Zero even after multiple runs using the OpenAI model version gpt-4o-mini-2024-07-18 and Azure API version 2024-10-21.
Run 2: 2025-02-17 10:10:07.297 INFO httpx HTTP Request: POST https://XXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-17 10:10:07.297 INFO root Total tokens: 1338 2025-02-17 10:10:07.297 INFO root Cached tokens: 0 2025-02-17 10:10:07.297 INFO root Total: 9874 | Cached: 0 | Percentage cached: 0.000000 2025-02-17 10:10:07.311 INFO root Run 1: 2025-02-17 10:10:08.233 INFO httpx HTTP Request: POST https://XXXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-17 10:10:08.236 INFO root Total tokens: 1336 2025-02-17 10:10:08.236 INFO root Cached tokens: 0 2025-02-17 10:10:11.238 INFO root Run 2: 2025-02-17 10:10:12.073 INFO httpx HTTP Request: POST https://XXXXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-17 10:10:12.073 INFO root Total tokens: 1393 2025-02-17 10:10:12.074 INFO root Cached tokens: 0 2025-02-17 10:10:12.074 INFO root Total: 12603 | Cached: 0 | Percentage cached: 0.000000
Did you try accessing it using o1 models ? Do you encounter the same issue ?
I have also sent you a private message asking for more details, Please provide those details once you get a chance.
Awaiting your reply.
-
navba-MSFT • 27,540 Reputation points • Microsoft Employee • Moderator
2025-02-19T08:47:22.7666667+00:00 @Dilshan Sandhu A quick follow-up to check if you had a chance to look at my private message. Awaiting your reply.
-
Javier Jiménez de la Jara • 0 Reputation points
2025-02-24T08:25:05.3833333+00:00 I have a similar issue but I don't get any cached token. I have tried the script that @Dilshan Sandhu shared with the same result.
2025-02-24 09:16:52.467 INFO root Run 1: 2025-02-24 09:16:53.523 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:16:53.528 INFO root Total tokens: 1096 2025-02-24 09:16:53.528 INFO root Cached tokens: 0 2025-02-24 09:16:56.529 INFO root Run 2: 2025-02-24 09:16:56.807 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:16:56.807 INFO root Total tokens: 1199 2025-02-24 09:16:56.807 INFO root Cached tokens: 0 2025-02-24 09:16:56.807 INFO root Total: 2295 | Cached: 0 | Percentage cached: 0.000000 2025-02-24 09:16:56.807 INFO root Run 1: 2025-02-24 09:16:57.834 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:16:57.834 INFO root Total tokens: 1199 2025-02-24 09:16:57.834 INFO root Cached tokens: 0 2025-02-24 09:17:00.835 INFO root Run 2: 2025-02-24 09:17:01.751 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:17:01.752 INFO root Total tokens: 1222 2025-02-24 09:17:01.752 INFO root Cached tokens: 0 2025-02-24 09:17:01.752 INFO root Total: 4716 | Cached: 0 | Percentage cached: 0.000000
Someone knows how to fix it?
-
Dilshan Sandhu • 15 Reputation points
2025-02-25T09:42:24.89+00:00 Hi @navba-MSFT , @Javier Jiménez de la Jara
I apologize for the delayed response.
I see why the cached tokens you're getting is 0. This happens because Azure takes time to cache the tokens before they appear in the cached tokens field (however, this is not the case with OpenAI. OpenAI caches much faster.)
I've made a slight modification to the script such that after the first loop, code sleeps for 15 seconds.
MAIN_LOOPS = 3 # Run the main function for j in range(MAIN_LOOPS): for i in range(5): main(messages, tools, user_query2) logging.info( "Run: %s | Total: %d | Cached: %d | Percentage cached: %f", str(j)+ "_" +str(i), total_tokens, cached_tokens, cached_tokens / total_tokens, ) if j == 0: time.sleep(15)
This is what i received at each subsequent run with Azure.
... Run: 0_4 | Total: 12605 | Cached: 0 | Percentage cached: 0.000000 ... Run: 1_4 | Total: 27955 | Cached: 1408 | Percentage cached: 0.050367 ... Run: 2_4 | Total: 46163 | Cached: 6528 | Percentage cached: 0.141412
To my original point, that OpenAI caching is working much better than Azure's.
This is what i received at each subsequent run with OpenAI.... Run: 0_4 | Total: 12672 | Cached: 9728 | Percentage cached: 0.767677 ... Run: 1_4 | Total: 28042 | Cached: 23552 | Percentage cached: 0.839883 ... Run: 2_4 | Total: 46263 | Cached: 39936 | Percentage cached: 0.863238
-
navba-MSFT • 27,540 Reputation points • Microsoft Employee • Moderator
2025-03-03T08:21:16.26+00:00 @Dilshan Sandhu Thanks for your reply. I am able to see cached token post inducing delay. If you wish to check the cause of the Azure OpenAI caching issue, please share the below details over private message.
Please provide the below details:
- Azure OpenAI resource URI in below format:
/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXX
- Region where your resource is deployed to.Please provide the below details:
- Azure OpenAI resource URI in below format:
- Region where your resource is deployed to.
Awaiting your reply.
-
Dilshan Sandhu • 15 Reputation points
2025-03-04T07:46:17.3366667+00:00 Hi @navba-MSFT ,
I have shared with you the required details over the private message.
-
navba-MSFT • 27,540 Reputation points • Microsoft Employee • Moderator
2025-03-10T05:25:21.6133333+00:00 @Dilshan Sandhu Apologies for the late reply. I had shared your resource and issue details with the Product Owners.
I have got their reply. I am sharing it here as is.
Updates:
Caching behavior is described as a
best effort
optimization in our public documentation.I see that you are on the GlobalStandard tier, requests from user will be routed globally. In this case, cache hit is not a promise. Cache hit can be lower on global Standard tier since dynamic routing is involved.
.
Current status:
We are currently working on caching optimizations that will help improve cache hit rates for skus that involved dynamic routing (global standard). There is no ETA on when this would be completed.Hope this answers.
-
navba-MSFT • 27,540 Reputation points • Microsoft Employee • Moderator
2025-03-18T06:46:41.0533333+00:00 @Dilshan Sandhu Just following up to check if you had a chance to look at my above reply. Please let me know if you have any further queries. I would be happy to help.
Sign in to comment