Azure OpenAI - Prompt Caching does not improve latency

Question

Azure OpenAI - Prompt Caching does not improve latency

Vitcu, Marvin 10

I tested prompt caching with Azure OpenAI using several models, including GPT-4o, GPT-4o-mini, and o1-preview.

For these tests, I used a range of input sizes, from 10k to 100k tokens per request. However, repeating the same user request multiple times (in order to leverage the prompt caching), did not lead to faster response times.

In all the models I tested, there were no improvements in latency. The time it took to generate an answer remained consistent.

Has anyone else had a similar experience? Or has anyone achieved faster response times using prompt caching?

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-10-28T17:02:27.3366667+00:00

Just checking to see if you had a chance to review my response.

Do let me know if that helps or have any other queries.
kevin tan 10 Reputation points

2024-11-13T03:51:30.0466667+00:00

Hi @Vitcu, Marvin , do you summit a support request and find out the issue?
Vitcu, Marvin 10 Reputation points

2024-11-18T15:19:16.32+00:00

kevin tan

No, I haven't contacted support yet. Currently waiting until we get the cached_tokens response parameter for the gpt-4o models.

This will help with the debugging and finding the issue. Right now it's a black box and you don't really know what's going on.

1 answer

Your answer

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-10-28T17:02:27.3366667+00:00

Just checking to see if you had a chance to review my response.

Do let me know if that helps or have any other queries.
kevin tan 10 Reputation points

2024-11-13T03:51:30.0466667+00:00

Hi @Vitcu, Marvin , do you summit a support request and find out the issue?
Vitcu, Marvin 10 Reputation points

2024-11-18T15:19:16.32+00:00

kevin tan

No, I haven't contacted support yet. Currently waiting until we get the cached_tokens response parameter for the gpt-4o models.

This will help with the debugging and finding the issue. Right now it's a black box and you don't really know what's going on.

Answer 1

AshokPeddakotla-MSFT 35,971 Moderator

Vitcu, Marvin Greetings & Welcome to Microsoft Q&A forum!

I understand your concern. Prompt caching allows you to reduce overall request latency

For these tests, I used a range of input sizes, from 10k to 100k tokens per request. However, repeating the same user request multiple times (in order to leverage the prompt caching), did not lead to faster response times.

As per my understanding, Prompts that haven't been used recently are automatically removed from the cache. To minimize evictions, maintain a consistent stream of requests with the same prompt prefix.

Also, see Performance and latency and follow the best practices to improve the latency.

In all the models I tested, there were no improvements in latency. The time it took to generate an answer remained consistent.

Unfortunately, I haven't found a way to check this.

Do you have any samples that shows the difference? Did you try with different prompts?

Vitcu, Marvin 10 Reputation points

2024-10-29T07:29:02.55+00:00
AshokPeddakotla-MSFT

Thank you for your response.

So far I still don't get any latency improvements.

This is how I tested it:

Use the latest gpt-4o or gpt-4o-mini model.

Have a prompt that always gives back the same answer, and does not change at all. (e.g. I use snippets of a book and ask gpt-4o to answer the question: "Is Harry Potter mentioned in this text. You should only answer with 'yes' or 'no'. Do not explain anything."

The book snippets ranged from 5k to 100k Tokens

Repeat the same prompt (without changing the book snippet or anything else in the prompt) every 10 seconds.

I would expect that the 2nd, 3rd, 4th, ... repetition of the same prompt would result in a faster response time. However, this is not the case.

In order to make sure that Prompt Caching is indeed active, I made a test run with o1-preview. In the JSON response, the property cached_tokens showed that the prompt was indeed being cached and that the 2nd, 3rd, 4th try also made use of that.

The problem is currently just, that the answering time does not improve at all.
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-10-30T07:25:30.7466667+00:00

Vitcu, Marvin Got it. For a deeper investigation and immediate assistance on this issue, please file a support request @ https://aka.ms/azsupt?

Share via

Azure OpenAI - Prompt Caching does not improve latency

1 answer

Your answer