Azure OpenAI - Prompt Caching does not improve latency

Vitcu, Marvin 10 Reputation points
2024-10-24T11:36:56.78+00:00

I tested prompt caching with Azure OpenAI using several models, including GPT-4o, GPT-4o-mini, and o1-preview.

For these tests, I used a range of input sizes, from 10k to 100k tokens per request. However, repeating the same user request multiple times (in order to leverage the prompt caching), did not lead to faster response times.

In all the models I tested, there were no improvements in latency. The time it took to generate an answer remained consistent.

Has anyone else had a similar experience? Or has anyone achieved faster response times using prompt caching?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,081 questions
{count} votes

1 answer

Sort by: Most helpful
  1. AshokPeddakotla-MSFT 35,971 Reputation points Moderator
    2024-10-24T13:29:53.4533333+00:00

    Vitcu, Marvin Greetings & Welcome to Microsoft Q&A forum!

    I understand your concern. Prompt caching allows you to reduce overall request latency

    For these tests, I used a range of input sizes, from 10k to 100k tokens per request. However, repeating the same user request multiple times (in order to leverage the prompt caching), did not lead to faster response times.

    As per my understanding, Prompts that haven't been used recently are automatically removed from the cache. To minimize evictions, maintain a consistent stream of requests with the same prompt prefix.

    Also, see Performance and latency and follow the best practices to improve the latency.

    In all the models I tested, there were no improvements in latency. The time it took to generate an answer remained consistent.

    Unfortunately, I haven't found a way to check this.

    Do you have any samples that shows the difference? Did you try with different prompts?


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.