Events
17 Mar, 9 pm - 21 Mar, 10 am
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
APPLIES TO: All API Management tiers
Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see Tutorial: Use Azure Cache for Redis as a semantic cache.
Note
The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the Azure AI Model Inference API.
Note
You can only enable the RediSearch module when creating a new Azure Redis Enterprise or Azure Managed Redis cache. You can't add a module to an existing cache. Learn more
First, test the Azure OpenAI deployment to ensure that the Chat Completion API or Chat API is working as expected. For steps, see Import an Azure OpenAI API to Azure API Management.
For example, test the Azure OpenAI Chat API by sending a POST request to the API endpoint with a prompt in the request body. The response should include the completion of the prompt. Example request:
POST https://my-api-management.azure-api.net/my-api/openai/deployments/chat-deployment/chat/completions?api-version=2024-02-01
with request body:
{"messages":[{"role":"user","content":"Hello"}]}
When the request succeeds, the response includes a completion for the chat message.
Configure a backend resource for the embeddings API deployment with the following settings:
embeddings-backend
. You use this name to reference the backend in policies.https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
https://cognitiveservices.azure.com/
for Azure OpenAI Service.To test the backend, create an API operation for your Azure OpenAI Service API:
/
.Content-Type
and value application/json
.Configure the following policies in the Inbound processing section of the API operation. In the set-backend-service policy, substitute the name of the backend you created.
<policies>
<inbound>
<set-backend-service backend-id="embeddings-backend" />
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
[...]
</inbound>
[...]
</policies>
On the Test tab, test the operation by adding an api-version
query parameter with value such as 2024-02-01
. Provide a valid request body. For example:
{"input":"Hello"}
If the request is successful, the response includes a vector representation of the input text:
{
"object": "list",
"data": [{
"object": "embedding",
"index": 0,
"embedding": [
-0.021829502,
-0.007157768,
-0.028619017,
[...]
]
}]
}
To enable semantic caching for Azure OpenAI APIs in Azure API Management, apply the following policies: one to check the cache before sending requests (lookup) and another to store responses for future reuse (store):
In the Inbound processing section for the API, add the azure-openai-semantic-cache-lookup policy. In the embeddings-backend-id
attribute, specify the Embeddings API backend you created.
Note
When enabling semantic caching for other large language model APIs, use the llm-semantic-cache-lookup policy instead.
Example:
<azure-openai-semantic-cache-lookup
score-threshold="0.8"
embeddings-backend-id="embeddings-deployment"
embeddings-backend-auth="system-assigned"
ignore-system-messages="true"
max-message-count="10">
<vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
In the Outbound processing section for the API, add the azure-openai-semantic-cache-store policy.
Note
When enabling semantic caching for other large language model APIs, use the llm-semantic-cache-store policy instead.
Example:
<azure-openai-semantic-cache-store duration="60" />
To confirm that semantic caching is working as expected, trace a test Completion or Chat Completion operation using the test console in the portal. Confirm that the cache was used on subsequent tries by inspecting the trace. Learn more about tracing API calls in Azure API Management.
For example, if the cache was used, the Output section includes entries similar to ones in the following screenshot:
Events
17 Mar, 9 pm - 21 Mar, 10 am
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowTraining
Learning path
AZ-204: Implement caching for solutions - Training
Learn how to improve the performance and scalability of your applications by integrating Azure Cache for Redis and Azure Content Delivery Network in to your solution.
Certification
Microsoft Certified: Azure AI Engineer Associate - Certifications
Design and implement an Azure AI solution using Azure AI services, Azure AI Search, and Azure Open AI.
Documentation
Azure API Management policy reference - azure-openai-semantic-cache-store
Reference for the azure-openai-semantic-cache-store policy available for use in Azure API Management. Provides policy usage, settings, and examples.
Azure API Management policy reference - azure-openai-semantic-cache-lookup
Reference for the azure-openai-semantic-cache-lookup policy available for use in Azure API Management. Provides policy usage, settings, and examples.
Azure API Management policy reference - llm-semantic-cache-lookup
Reference for the llm-semantic-cache-lookup policy available for use in Azure API Management. Provides policy usage, settings, and examples.