max_tokens in Azure Open AI not serving its purpose

Dey, Nikita 40 Reputation points
2024-11-05T11:30:04.0666667+00:00

Hello,
I am using Azure's OpenAI REST API to fetch responses but am struggling to limit token usage. According to the documentation, I can set a maximum token limit by using the max_tokens parameter, but it doesn’t seem to work as expected.
User's image

For example, when I set max_tokens to 2000 in my API request, here’s the request body:

{
    "max_tokens": 2000,
    "temperature": 0.2,

    "messages": [
        {
            "role": "user",
            "content": "give me xyz"
        }
    ],
    "data_sources": [
        {
            "type": "azure_search",
            "parameters": {
                "endpoint": "xyz",
                "index_name": "xyz",
                "authentication": {
                    "type": "api_key",
                    "key": "xyz"
                },
                "fields_mapping": {
                    "content_fields_separator": "\n",
                    "content_fields": [
                        "content"
                    ],
                    "filepath_field": "metadata_storage_name",
                    "title_field": "title",
                    "url_field": "metadata_storage_path",
                    "vector_fields": []
                },
                "in_scope": "true",
                "role_information": "You are an AI assistant that helps people find information.",
                "strictness": 3,
                "top_n_documents": 1,
                "semantic_configuration": "default-config",
                "query_type": "semantic"
                ": "",
            }
        }
    ]
}

the usage was
User's image

Despite specifying max_tokens: 2000, I’ve noticed that the actual token usage is largely driven by the response length, not strictly by the max_tokens parameter. It seems that the parameter isn’t consistently limiting the response as expected.

Could you please provide guidance on how to enforce a strict token limit on responses? Any additional insights into how Azure OpenAI calculates or manages token usage in conjunction with external data sources (like Azure Search) would be very helpful.

Thank you!

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,098 questions
0 comments No comments
{count} votes

Accepted answer
  1. santoshkc 15,355 Reputation points Microsoft External Staff Moderator
    2024-11-05T13:11:36.26+00:00

    Hi @Dey, Nikita,

    Thank you for reaching out to Microsoft Q&A forum!

    To enforce a strict token limit when using Azure's OpenAI REST API, start by understanding that the max_tokens parameter only caps the response tokens, not the total input tokens from prompts and external data sources. Since your input, especially when including results from Azure Search, can significantly inflate token usage, consider limiting the number of documents retrieved (top_n_documents) and selectively mapping only essential fields. Additionally, preprocess the data retrieved from Azure Search by truncating or summarizing it before sending it to OpenAI. This approach will help manage the total token count effectively.

    I hope you understand. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.


1 additional answer

Sort by: Most helpful
  1. Ifeoluwa Oduwaiye 0 Reputation points
    2024-11-05T12:47:31.3866667+00:00

    Hello Dey,

    While trying to limit your token usage, you need to understand that the max_tokens parameter in the API request only limits the maximum number of tokens in the response. It doesn’t account for tokens consumed by the input, system messages, or other metadata. This means that if your input uses a lot of tokens, the model might cut the response short to stay within the total token limit, but it won't always stick exactly to the response length you set with max_tokens.

    To limit your token usage, try and set a lower max_tokens value (such as 1000). You can experiment with values lower than 2000 to see how the output comes out. Additionally, you can adjust the values of temperature and top_p for more concise responses. Let me know if this works!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.