Get cached responses of large language model API requests
APPLIES TO: All API Management tiers
Use the llm-semantic-cache-lookup
policy to perform cache lookup of responses to large language model (LLM) API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend LLM API and lowers latency perceived by API consumers.
Note
- This policy must have a corresponding Cache responses to large language model API requests policy.
- For prerequisites and steps to enable semantic caching, see Enable semantic caching for Azure OpenAI APIs in Azure API Management.
- Currently, this policy is in preview.
Note
Set the policy's elements and child elements in the order provided in the policy statement. Learn more about how to set or edit API Management policies.
Supported models
Use the policy with LLM APIs added to Azure API Management that are available through the Azure AI Model Inference API.
Policy statement
<llm-semantic-cache-lookup
score-threshold="similarity score threshold"
embeddings-backend-id ="backend entity ID for embeddings API"
embeddings-backend-auth ="system-assigned"
ignore-system-messages="true | false"
max-message-count="count" >
<vary-by>"expression to partition caching"</vary-by>
</llm-semantic-cache-lookup>
Attributes
Attribute | Description | Required | Default |
---|---|---|---|
score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Learn more. | Yes | N/A |
embeddings-backend-id | Backend ID for OpenAI embeddings API call. | Yes | N/A |
embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to system-assigned . |
N/A |
ignore-system-messages | Boolean. If set to true , removes system messages from a GPT chat completion prompt before assessing cache similarity. |
No | false |
max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
Elements
Name | Description | Required |
---|---|---|
vary-by | A custom expression determined at runtime whose value partitions caching. If multiple vary-by elements are added, values are concatenated to create a unique combination. |
No |
Usage
- Policy sections: inbound
- Policy scopes: global, product, API, operation
- Gateways: v2
Usage notes
- This policy can only be used once in a policy section.
Examples
Example with corresponding llm-semantic-cache-store policy
<policies>
<inbound>
<base />
<llm-semantic-cache-lookup
score-threshold="0.05"
embeddings-backend-id ="llm-backend"
embeddings-backend-auth ="system-assigned" >
<vary-by>@(context.Subscription.Id)</vary-by>
</llm-semantic-cache-lookup>
</inbound>
<outbound>
<llm-semantic-cache-store duration="60" />
<base />
</outbound>
</policies>
Related policies
Related content
For more information about working with policies, see:
- Tutorial: Transform and protect your API
- Policy reference for a full list of policy statements and their settings
- Policy expressions
- Set or edit policies
- Reuse policy configurations
- Policy snippets repo
- Author policies using Microsoft Copilot in Azure