Azure OpenAI API Caching Issue with Model `gpt-4o-mini-2024-07-18`

Question

Azure OpenAI API Caching Issue with Model `gpt-4o-mini-2024-07-18`

Dilshan Sandhu 15

There is an issue being faced with the Azure OpenAI service.

Using the OpenAI model version gpt-4o-mini-2024-07-18 and Azure API version 2024-10-21, it was noted from Azure's documentation that both the OpenAI model and API versions should be eligible for caching.

The setup includes a static system prompt and a dynamic user prompt, with the system prompt being about 2000 tokens, which should enable caching.

However, after making over 50 API calls (not all concurrent), the OpenAI API reflected about 70% cached tokens, while the Azure OpenAI API showed a mere 0.1% cached tokens. Is this a recognized issue, and is anyone else experiencing similar results?

Saideep Anchuri 9,425 Reputation points Microsoft External Staff Moderator

2025-02-04T10:23:03.4533333+00:00

Hi Dilshan Sandhu

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

The issue you are experiencing with the Azure OpenAI service regarding caching with the gpt-4o-mini-2024-07-18 model may stem from several factors. While both the model and API versions you mentioned are eligible for caching, the effectiveness of caching depends on the structure of the prompts.

You're using a mix of a static system prompt and a dynamic user prompt, with each setup being 2,000 tokens. To enable caching, the first 1,024 tokens of the prompt need to be identical. Even a single character difference will cause a cache miss, showing a cached tokens value of 0. Caching is automatically enabled for supported models with no extra configuration needed.

To ensure caching, make sure the first few prompts are the same and make up 1,024 tokens in total.

kindly refer below link: prompt-caching

Thank You.
Dilshan Sandhu 15 Reputation points

2025-02-04T12:11:48.3566667+00:00
Hi @Saideep Anchuri ,
Thank you the prompt reply.

The SystemMessage part of the prompt is exactly same between the requests. It is a static text. There is no object-notation present ({}). Even if i do 50 requests in a loop (with or without concurrent requests), I mostly get 0 in cached_tokens. Here is a stat (of about 50 requests, with same user prompt (role: user) as well, SystemPrompt (role: system) remains the same always.):
Azure OpenAI: {'total_tokens': 228560, 'cached_tokens': 18816}
OpenAI: {'total_tokens': 228560, 'cached_tokens': 201600}

All i am changing in my code is from

self.client = AzureOpenAI( api_key=openai.api_key, api_version=api_version, azure_endpoint=azure_endpoint, )

to

self.client = OpenAI(api_key=openai.api_key)

No change other than this. If required, i can provide a minimum reproducible code.

OpenAI sdk version: 1.6.1

As recommended by you, i have read the documentation multiple times. Only two conditions are required to be satisfied:

At least 1024 tokens

Prompt prefix must remain same (upto at least 1024 tokens)

I believe, my testing examples follow both the criterias as the system prompt in itself is over 2000 tokens and it does not change.

Regards,
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-02-05T08:35:44.3866667+00:00

@Dilshan Sandhu Did you check the quota and limit of you subscription plans in azure openai? please check it once, meanwhile i will find more information regarding your case.
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-02-06T04:49:13.4066667+00:00

Hi @Dilshan Sandhu ,

Here are the things you need to check.

Cache Inactivity Timeout: Prompt caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. If there's a delay of more than an hour between your requests, the cache might expire, leading to fewer cache hits.

Prompt Consistency: Even though you have more than 1024 tokens, it may not be same/consistent while creating a prompt. Ensure that the first 1,024 tokens of your prompt are identical across all requests. Even a single character difference can result in a cache miss, which doesn't use the cached prompt.

For more understanding the issue please add the reproducible code of creating the prompt.
Dilshan Sandhu 15 Reputation points

2025-02-06T06:58:24.5033333+00:00
prompt_caching.txtHi @Saideep Anchuri ,

I have attached a python file which is a slight modification of the code provided by openai (https://cookbook.openai.com/examples/prompt_caching101).

At the top of the file, you can switch between azure and openai clients and run the script.

I received the following results at the end:
OpenAI:

Total: 12714 | Cached: 10880 | Percentage cached: 0.855750

Azure:

Total: 12533 | Cached: 3200 | Percentage cached: 0.255326

Note: The earlier results shared were on my application code (so they vary). Also one modification was made that this time I switched to a different Azure deployment, and with that change, Percentage Cached increased. I don't know why did that happen because model version and api version are same. Keeping that aside, Percentage cached is still way off from what OpenAI numbers suggest.

OpenAI sdk version: 1.61.0

Azure Model version: 2024-07-18

Azure API version: 2024-10-21

Model: gpt-4o-mini

To your points:

Cache Inactivity timeout does not matter because the script completes in under a minute.

According to the OpenAI's document shared above, they said tool caching also happens, so system message + tools token count is greater than 1024, so caching does happen.

Please let me know if something else is required from my side.
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-02-07T04:31:36.17+00:00

Hi @Dilshan Sandhu ,

We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.

Thank you

navba-MSFT 27,540 Microsoft Employee Moderator

@Dilshan Sandhu Thanks for sharing the details.

I used the sample code which you had shared above but unable to reproduce this issue at my end as the cached token always appears as Zero even after multiple runs using the OpenAI model version gpt-4o-mini-2024-07-18 and Azure API version 2024-10-21.

Run 2:

2025-02-17 10:10:07.297 INFO     httpx            HTTP Request: POST https://XXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"

2025-02-17 10:10:07.297 INFO     root             Total tokens: 1338

2025-02-17 10:10:07.297 INFO     root             Cached tokens: 0

2025-02-17 10:10:07.297 INFO     root             Total: 9874 | Cached: 0 | Percentage cached: 0.000000

2025-02-17 10:10:07.311 INFO     root             Run 1:

2025-02-17 10:10:08.233 INFO     httpx            HTTP Request: POST https://XXXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"

2025-02-17 10:10:08.236 INFO     root             Total tokens: 1336

2025-02-17 10:10:08.236 INFO     root             Cached tokens: 0

2025-02-17 10:10:11.238 INFO     root

Run 2:

2025-02-17 10:10:12.073 INFO     httpx            HTTP Request: POST https://XXXXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"

2025-02-17 10:10:12.073 INFO     root             Total tokens: 1393

2025-02-17 10:10:12.074 INFO     root             Cached tokens: 0

2025-02-17 10:10:12.074 INFO     root             Total: 12603 | Cached: 0 | Percentage cached: 0.000000

Did you try accessing it using o1 models ? Do you encounter the same issue ?

I have also sent you a private message asking for more details, Please provide those details once you get a chance.

Awaiting your reply.

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-02-19T08:47:22.7666667+00:00

@Dilshan Sandhu A quick follow-up to check if you had a chance to look at my private message. Awaiting your reply.

Javier Jiménez de la Jara 0

I have a similar issue but I don't get any cached token. I have tried the script that @Dilshan Sandhu shared with the same result.

2025-02-24 09:16:52.467 INFO     root             Run 1:
2025-02-24 09:16:53.523 INFO     httpx            HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"
2025-02-24 09:16:53.528 INFO     root             Total tokens: 1096
2025-02-24 09:16:53.528 INFO     root             Cached tokens: 0
2025-02-24 09:16:56.529 INFO     root             
Run 2:
2025-02-24 09:16:56.807 INFO     httpx            HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"
2025-02-24 09:16:56.807 INFO     root             Total tokens: 1199
2025-02-24 09:16:56.807 INFO     root             Cached tokens: 0
2025-02-24 09:16:56.807 INFO     root             Total: 2295 | Cached: 0 | Percentage cached: 0.000000
2025-02-24 09:16:56.807 INFO     root             Run 1:
2025-02-24 09:16:57.834 INFO     httpx            HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"
2025-02-24 09:16:57.834 INFO     root             Total tokens: 1199
2025-02-24 09:16:57.834 INFO     root             Cached tokens: 0
2025-02-24 09:17:00.835 INFO     root             
Run 2:
2025-02-24 09:17:01.751 INFO     httpx            HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK"
2025-02-24 09:17:01.752 INFO     root             Total tokens: 1222
2025-02-24 09:17:01.752 INFO     root             Cached tokens: 0
2025-02-24 09:17:01.752 INFO     root             Total: 4716 | Cached: 0 | Percentage cached: 0.000000

Someone knows how to fix it?

Dilshan Sandhu 15

Hi @navba-MSFT , @Javier Jiménez de la Jara

I apologize for the delayed response.

I see why the cached tokens you're getting is 0. This happens because Azure takes time to cache the tokens before they appear in the cached tokens field (however, this is not the case with OpenAI. OpenAI caches much faster.)

I've made a slight modification to the script such that after the first loop, code sleeps for 15 seconds.

MAIN_LOOPS = 3
# Run the main function
for j in range(MAIN_LOOPS):
    for i in range(5):
        main(messages, tools, user_query2)
        logging.info(
            "Run: %s | Total: %d | Cached: %d | Percentage cached: %f",
            str(j)+ "_" +str(i),
            total_tokens,
            cached_tokens,
            cached_tokens / total_tokens,
        )
    if j == 0:
        time.sleep(15)

This is what i received at each subsequent run with Azure.

...
Run: 0_4 | Total: 12605 | Cached: 0 | Percentage cached: 0.000000
...
Run: 1_4 | Total: 27955 | Cached: 1408 | Percentage cached: 0.050367
...
Run: 2_4 | Total: 46163 | Cached: 6528 | Percentage cached: 0.141412

To my original point, that OpenAI caching is working much better than Azure's.
This is what i received at each subsequent run with OpenAI.

...
Run: 0_4 | Total: 12672 | Cached: 9728 | Percentage cached: 0.767677
...
Run: 1_4 | Total: 28042 | Cached: 23552 | Percentage cached: 0.839883
...
Run: 2_4 | Total: 46263 | Cached: 39936 | Percentage cached: 0.863238

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-03-03T08:21:16.26+00:00
@Dilshan Sandhu Thanks for your reply. I am able to see cached token post inducing delay. If you wish to check the cause of the Azure OpenAI caching issue, please share the below details over private message.

Please provide the below details:

Azure OpenAI resource URI in below format:

/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXX

Region where your resource is deployed to.Please provide the below details:

Azure OpenAI resource URI in below format:

/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXX

Region where your resource is deployed to.

Awaiting your reply.
Dilshan Sandhu 15 Reputation points

2025-03-04T07:46:17.3366667+00:00

Hi @navba-MSFT ,

I have shared with you the required details over the private message.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-03-10T05:25:21.6133333+00:00

@Dilshan Sandhu Apologies for the late reply. I had shared your resource and issue details with the Product Owners.

I have got their reply. I am sharing it here as is.

Updates:

Caching behavior is described as a best effort optimization in our public documentation.

I see that you are on the GlobalStandard tier, requests from user will be routed globally. In this case, cache hit is not a promise. Cache hit can be lower on global Standard tier since dynamic routing is involved.

.

Current status:
We are currently working on caching optimizations that will help improve cache hit rates for skus that involved dynamic routing (global standard). There is no ETA on when this would be completed.

Hope this answers.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-03-18T06:46:41.0533333+00:00

@Dilshan Sandhu Just following up to check if you had a chance to look at my above reply. Please let me know if you have any further queries. I would be happy to help.

Your answer

Saideep Anchuri 9,425 Reputation points Microsoft External Staff Moderator

2025-02-04T10:23:03.4533333+00:00

Hi Dilshan Sandhu

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

The issue you are experiencing with the Azure OpenAI service regarding caching with the gpt-4o-mini-2024-07-18 model may stem from several factors. While both the model and API versions you mentioned are eligible for caching, the effectiveness of caching depends on the structure of the prompts.

You're using a mix of a static system prompt and a dynamic user prompt, with each setup being 2,000 tokens. To enable caching, the first 1,024 tokens of the prompt need to be identical. Even a single character difference will cause a cache miss, showing a cached tokens value of 0. Caching is automatically enabled for supported models with no extra configuration needed.

To ensure caching, make sure the first few prompts are the same and make up 1,024 tokens in total.

kindly refer below link: prompt-caching

Thank You.
Dilshan Sandhu 15 Reputation points

2025-02-04T12:11:48.3566667+00:00

Hi @Saideep Anchuri ,
Thank you the prompt reply.

The SystemMessage part of the prompt is exactly same between the requests. It is a static text. There is no object-notation present ({}). Even if i do 50 requests in a loop (with or without concurrent requests), I mostly get 0 in cached_tokens. Here is a stat (of about 50 requests, with same user prompt (role: user) as well, SystemPrompt (role: system) remains the same always.):
Azure OpenAI: {'total_tokens': 228560, 'cached_tokens': 18816}
OpenAI: {'total_tokens': 228560, 'cached_tokens': 201600}

All i am changing in my code is from

self.client = AzureOpenAI( api_key=openai.api_key, api_version=api_version, azure_endpoint=azure_endpoint, )

to

self.client = OpenAI(api_key=openai.api_key)

No change other than this. If required, i can provide a minimum reproducible code.

OpenAI sdk version: 1.6.1

As recommended by you, i have read the documentation multiple times. Only two conditions are required to be satisfied:

At least 1024 tokens

Prompt prefix must remain same (upto at least 1024 tokens)

I believe, my testing examples follow both the criterias as the system prompt in itself is over 2000 tokens and it does not change.

Regards,
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-02-05T08:35:44.3866667+00:00

@Dilshan Sandhu Did you check the quota and limit of you subscription plans in azure openai? please check it once, meanwhile i will find more information regarding your case.
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-02-06T04:49:13.4066667+00:00

Hi @Dilshan Sandhu ,

Here are the things you need to check.

Cache Inactivity Timeout: Prompt caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. If there's a delay of more than an hour between your requests, the cache might expire, leading to fewer cache hits.

Prompt Consistency: Even though you have more than 1024 tokens, it may not be same/consistent while creating a prompt. Ensure that the first 1,024 tokens of your prompt are identical across all requests. Even a single character difference can result in a cache miss, which doesn't use the cached prompt.

For more understanding the issue please add the reproducible code of creating the prompt.
Dilshan Sandhu 15 Reputation points

2025-02-06T06:58:24.5033333+00:00

prompt_caching.txtHi @Saideep Anchuri ,

I have attached a python file which is a slight modification of the code provided by openai (https://cookbook.openai.com/examples/prompt_caching101).

At the top of the file, you can switch between azure and openai clients and run the script.

I received the following results at the end:
OpenAI:

Total: 12714 | Cached: 10880 | Percentage cached: 0.855750

Azure:

Total: 12533 | Cached: 3200 | Percentage cached: 0.255326

Note: The earlier results shared were on my application code (so they vary). Also one modification was made that this time I switched to a different Azure deployment, and with that change, Percentage Cached increased. I don't know why did that happen because model version and api version are same. Keeping that aside, Percentage cached is still way off from what OpenAI numbers suggest.

OpenAI sdk version: 1.61.0

Azure Model version: 2024-07-18

Azure API version: 2024-10-21

Model: gpt-4o-mini

To your points:

Cache Inactivity timeout does not matter because the script completes in under a minute.

According to the OpenAI's document shared above, they said tool caching also happens, so system message + tools token count is greater than 1024, so caching does happen.

Please let me know if something else is required from my side.
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-02-07T04:31:36.17+00:00

Hi @Dilshan Sandhu ,

We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.

Thank you
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-02-17T04:55:15.32+00:00

@Dilshan Sandhu Thanks for sharing the details.

I used the sample code which you had shared above but unable to reproduce this issue at my end as the cached token always appears as Zero even after multiple runs using the OpenAI model version gpt-4o-mini-2024-07-18 and Azure API version 2024-10-21.

Run 2: 2025-02-17 10:10:07.297 INFO httpx HTTP Request: POST https://XXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-17 10:10:07.297 INFO root Total tokens: 1338 2025-02-17 10:10:07.297 INFO root Cached tokens: 0 2025-02-17 10:10:07.297 INFO root Total: 9874 | Cached: 0 | Percentage cached: 0.000000 2025-02-17 10:10:07.311 INFO root Run 1: 2025-02-17 10:10:08.233 INFO httpx HTTP Request: POST https://XXXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-17 10:10:08.236 INFO root Total tokens: 1336 2025-02-17 10:10:08.236 INFO root Cached tokens: 0 2025-02-17 10:10:11.238 INFO root Run 2: 2025-02-17 10:10:12.073 INFO httpx HTTP Request: POST https://XXXXXX.openai.azure.com/openai/deployments/gpt4Omini/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-17 10:10:12.073 INFO root Total tokens: 1393 2025-02-17 10:10:12.074 INFO root Cached tokens: 0 2025-02-17 10:10:12.074 INFO root Total: 12603 | Cached: 0 | Percentage cached: 0.000000

Did you try accessing it using o1 models ? Do you encounter the same issue ?

I have also sent you a private message asking for more details, Please provide those details once you get a chance.

Awaiting your reply.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-02-19T08:47:22.7666667+00:00

@Dilshan Sandhu A quick follow-up to check if you had a chance to look at my private message. Awaiting your reply.
Javier Jiménez de la Jara 0 Reputation points

2025-02-24T08:25:05.3833333+00:00

I have a similar issue but I don't get any cached token. I have tried the script that @Dilshan Sandhu shared with the same result.

2025-02-24 09:16:52.467 INFO root Run 1: 2025-02-24 09:16:53.523 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:16:53.528 INFO root Total tokens: 1096 2025-02-24 09:16:53.528 INFO root Cached tokens: 0 2025-02-24 09:16:56.529 INFO root Run 2: 2025-02-24 09:16:56.807 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:16:56.807 INFO root Total tokens: 1199 2025-02-24 09:16:56.807 INFO root Cached tokens: 0 2025-02-24 09:16:56.807 INFO root Total: 2295 | Cached: 0 | Percentage cached: 0.000000 2025-02-24 09:16:56.807 INFO root Run 1: 2025-02-24 09:16:57.834 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:16:57.834 INFO root Total tokens: 1199 2025-02-24 09:16:57.834 INFO root Cached tokens: 0 2025-02-24 09:17:00.835 INFO root Run 2: 2025-02-24 09:17:01.751 INFO httpx HTTP Request: POST https://azureml-pocs-aoai.openai.azure.com/openai/deployments/gpt-4o-2/chat/completions?api-version=2024-10-21 "HTTP/1.1 200 OK" 2025-02-24 09:17:01.752 INFO root Total tokens: 1222 2025-02-24 09:17:01.752 INFO root Cached tokens: 0 2025-02-24 09:17:01.752 INFO root Total: 4716 | Cached: 0 | Percentage cached: 0.000000

Someone knows how to fix it?
Dilshan Sandhu 15 Reputation points

2025-02-25T09:42:24.89+00:00

Hi @navba-MSFT , @Javier Jiménez de la Jara

I apologize for the delayed response.

I see why the cached tokens you're getting is 0. This happens because Azure takes time to cache the tokens before they appear in the cached tokens field (however, this is not the case with OpenAI. OpenAI caches much faster.)

I've made a slight modification to the script such that after the first loop, code sleeps for 15 seconds.

MAIN_LOOPS = 3 # Run the main function for j in range(MAIN_LOOPS): for i in range(5): main(messages, tools, user_query2) logging.info( "Run: %s | Total: %d | Cached: %d | Percentage cached: %f", str(j)+ "_" +str(i), total_tokens, cached_tokens, cached_tokens / total_tokens, ) if j == 0: time.sleep(15)

This is what i received at each subsequent run with Azure.

... Run: 0_4 | Total: 12605 | Cached: 0 | Percentage cached: 0.000000 ... Run: 1_4 | Total: 27955 | Cached: 1408 | Percentage cached: 0.050367 ... Run: 2_4 | Total: 46163 | Cached: 6528 | Percentage cached: 0.141412

To my original point, that OpenAI caching is working much better than Azure's.
This is what i received at each subsequent run with OpenAI.

... Run: 0_4 | Total: 12672 | Cached: 9728 | Percentage cached: 0.767677 ... Run: 1_4 | Total: 28042 | Cached: 23552 | Percentage cached: 0.839883 ... Run: 2_4 | Total: 46263 | Cached: 39936 | Percentage cached: 0.863238
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-03-03T08:21:16.26+00:00

@Dilshan Sandhu Thanks for your reply. I am able to see cached token post inducing delay. If you wish to check the cause of the Azure OpenAI caching issue, please share the below details over private message.

Please provide the below details:

Azure OpenAI resource URI in below format:

/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXX

Region where your resource is deployed to.Please provide the below details:

Azure OpenAI resource URI in below format:

/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXX

Region where your resource is deployed to.

Awaiting your reply.
Dilshan Sandhu 15 Reputation points

2025-03-04T07:46:17.3366667+00:00

Hi @navba-MSFT ,

I have shared with you the required details over the private message.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-03-10T05:25:21.6133333+00:00

@Dilshan Sandhu Apologies for the late reply. I had shared your resource and issue details with the Product Owners.

I have got their reply. I am sharing it here as is.

Updates:

Caching behavior is described as a best effort optimization in our public documentation.

I see that you are on the GlobalStandard tier, requests from user will be routed globally. In this case, cache hit is not a promise. Cache hit can be lower on global Standard tier since dynamic routing is involved.

.

Current status:
We are currently working on caching optimizations that will help improve cache hit rates for skus that involved dynamic routing (global standard). There is no ETA on when this would be completed.

Hope this answers.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2025-03-18T06:46:41.0533333+00:00

@Dilshan Sandhu Just following up to check if you had a chance to look at my above reply. Please let me know if you have any further queries. I would be happy to help.

Share via

Azure OpenAI API Caching Issue with Model `gpt-4o-mini-2024-07-18`

Your answer