max_tokens in Azure Open AI not serving its purpose

Question

max_tokens in Azure Open AI not serving its purpose

Dey, Nikita 40

Hello,
I am using Azure's OpenAI REST API to fetch responses but am struggling to limit token usage. According to the documentation, I can set a maximum token limit by using the max_tokens parameter, but it doesn’t seem to work as expected.
User's image

For example, when I set max_tokens to 2000 in my API request, here’s the request body:

{
    "max_tokens": 2000,
    "temperature": 0.2,

    "messages": [
        {
            "role": "user",
            "content": "give me xyz"
        }
    ],
    "data_sources": [
        {
            "type": "azure_search",
            "parameters": {
                "endpoint": "xyz",
                "index_name": "xyz",
                "authentication": {
                    "type": "api_key",
                    "key": "xyz"
                },
                "fields_mapping": {
                    "content_fields_separator": "\n",
                    "content_fields": [
                        "content"
                    ],
                    "filepath_field": "metadata_storage_name",
                    "title_field": "title",
                    "url_field": "metadata_storage_path",
                    "vector_fields": []
                },
                "in_scope": "true",
                "role_information": "You are an AI assistant that helps people find information.",
                "strictness": 3,
                "top_n_documents": 1,
                "semantic_configuration": "default-config",
                "query_type": "semantic"
                ": "",
            }
        }
    ]
}

the usage was
User's image

Despite specifying max_tokens: 2000, I’ve noticed that the actual token usage is largely driven by the response length, not strictly by the max_tokens parameter. It seems that the parameter isn’t consistently limiting the response as expected.

Could you please provide guidance on how to enforce a strict token limit on responses? Any additional insights into how Azure OpenAI calculates or manages token usage in conjunction with external data sources (like Azure Search) would be very helpful.

Thank you!

Accepted answer

1 additional answer

Your answer

Answer 1

santoshkc 15,355 Microsoft External Staff Moderator

Hi @Dey, Nikita,

Thank you for reaching out to Microsoft Q&A forum!

To enforce a strict token limit when using Azure's OpenAI REST API, start by understanding that the max_tokens parameter only caps the response tokens, not the total input tokens from prompts and external data sources. Since your input, especially when including results from Azure Search, can significantly inflate token usage, consider limiting the number of documents retrieved (top_n_documents) and selectively mapping only essential fields. Additionally, preprocess the data retrieved from Azure Search by truncating or summarizing it before sending it to OpenAI. This approach will help manage the total token count effectively.

I hope you understand. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Dey, Nikita 40 Reputation points

2024-11-06T04:32:17.46+00:00

Thanks for answering @santoshkc

It makes sense that Azure AI Search significantly impacts tokens, and I've already limited documents to 1 while mapping the most essential fields in my example. Given these constraints, I can’t modify it further.
However, I'm curious if there’s a way to preprocess the data retrieved from Azure Search by summarizing it. Any insights you could provide on this would be greatly appreciated.
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-11-06T12:20:45.5833333+00:00

Hi @Dey, Nikita,

Thank you for your follow-up query.

To summarize Azure Search data before sending it to OpenAI, you can rely entirely on Azure OpenAI itself. Start by making a preliminary API call with a prompt like “Summarize the following content in 100 words.” This allows you to condense the retrieved information into a shorter form before including it in your main request. This approach minimizes token usage while retaining essential context, ensuring that the prompt stays efficient and within desired token limits.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-11-07T10:44:34.1766667+00:00

Hi @Dey, Nikita,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you.
Dey, Nikita 40 Reputation points

2024-11-07T12:16:01.7233333+00:00

Sure, your comments helped me adapt a clearer understanding of the situation.
Dey, Nikita 40 Reputation points

2024-11-07T12:18:29.3433333+00:00

Thank you for your patience @santoshkc
Summarizing data is helpful to some extent, but using a model with higher rate limits might be the optimal approach overall.
Thanks again for your assistance!

Answer 2

Ifeoluwa Oduwaiye 0

Hello Dey,

While trying to limit your token usage, you need to understand that the max_tokens parameter in the API request only limits the maximum number of tokens in the response. It doesn’t account for tokens consumed by the input, system messages, or other metadata. This means that if your input uses a lot of tokens, the model might cut the response short to stay within the total token limit, but it won't always stick exactly to the response length you set with max_tokens.

To limit your token usage, try and set a lower max_tokens value (such as 1000). You can experiment with values lower than 2000 to see how the output comes out. Additionally, you can adjust the values of temperature and top_p for more concise responses. Let me know if this works!

Dey, Nikita 40 Reputation points

2024-11-06T04:44:07.2633333+00:00

Thanks for answering @Ifeoluwa Oduwaiye

Thank you for explaining the concept behind the max_tokens parameter. Following your advice, I experimented with lower values, and here are my observations:

max_tokens=2000

max_tokens =1000

max_tokens=500

As you can see, adjusting the token limit did not affect the output.
I still can’t quite wrap my mind around how this parameter impacts token usage when a data source is associated, as it seems to make no difference in practice.

Share via

max_tokens in Azure Open AI not serving its purpose

1 additional answer

Your answer