Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
APPLIES TO: All API Management tiers
Use the llm-semantic-cache-lookup
policy to perform cache lookup of responses to large language model (LLM) API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend LLM API and lowers latency perceived by API consumers.
Note
Note
Set the policy's elements and child elements in the order provided in the policy statement. Learn more about how to set or edit API Management policies.
Use the policy with LLM APIs added to Azure API Management that are available through the Azure AI Model Inference API.
<llm-semantic-cache-lookup
score-threshold="similarity score threshold"
embeddings-backend-id ="backend entity ID for embeddings API"
embeddings-backend-auth ="system-assigned"
ignore-system-messages="true | false"
max-message-count="count" >
<vary-by>"expression to partition caching"</vary-by>
</llm-semantic-cache-lookup>
Attribute | Description | Required | Default |
---|---|---|---|
score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Learn more. | Yes | N/A |
embeddings-backend-id | Backend ID for OpenAI embeddings API call. | Yes | N/A |
embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to system-assigned . |
N/A |
ignore-system-messages | Boolean. If set to true , removes system messages from a GPT chat completion prompt before assessing cache similarity. |
No | false |
max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
Name | Description | Required |
---|---|---|
vary-by | A custom expression determined at runtime whose value partitions caching. If multiple vary-by elements are added, values are concatenated to create a unique combination. |
No |
<policies>
<inbound>
<base />
<llm-semantic-cache-lookup
score-threshold="0.05"
embeddings-backend-id ="llm-backend"
embeddings-backend-auth ="system-assigned" >
<vary-by>@(context.Subscription.Id)</vary-by>
</llm-semantic-cache-lookup>
</inbound>
<outbound>
<llm-semantic-cache-store duration="60" />
<base />
</outbound>
</policies>
For more information about working with policies, see:
Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register now