MAX_TOKEN not working for response from Azure Open AI

Question

MAX_TOKEN not working for response from Azure Open AI

Prince Solomon 0

We are currently using gpt-4o version "2024-11-20" which has a output token limit of 16k, but it seems to default to 4k output tokens.

Does Azure support 16k output tokens ? If so, how can we set the value ?

2 answers

Your answer

Answer 1

Sina Salam 22,031 Volunteer Moderator

Hello Prince Solomon,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand you are having issues with setting the max_tokens parameter for GPT-4o in Azure OpenAI, especially, to utilize its full 16k output token capacity.

Yes, Azure OpenAI does indeed support up to 16,384 tokens for GPT-4o, but by default, the system limits output to 4,096 tokens unless explicitly configured otherwise. This behavior often leads to confusion when developers expect longer responses but receive truncated outputs. The key point here is that the max_tokens parameter must be manually set in your API request to a value that fits within the model’s total token limit, which includes both input and output tokens.

Therefore, to fully leverage the 16k output capability, you should calculate the number of tokens in your input prompt and subtract that from 16,384 to determine the maximum allowable output. You can use the tiktoken library to tokenize your input and ensure compliance with the model’s constraints. Here's a sample configuration for your API request:

{
  "model": "gpt-4o",
  "messages": [
    {"role": "user", "content": "Your prompt here"}
  ],
  "max_tokens": 12000,
  "temperature": 0.7
}

This setup ensures that your output is not prematurely cut off due to default limits. If max_tokens is omitted or set too high without accounting for input length, Azure may default to 4k or truncate the response unexpectedly.

For more details on token usage and best practices, read more on the official Azure OpenAI documentation.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Prince Solomon 0

@Sina Salam I have added my code below and I tried with a lesser token length but it's still not working

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Give an essay about trains with 5000 words",
            }
        ]
    },
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": response.content,
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Why you are not able to provide the entire 5000 words ? I have explicitly set the max_tokens to 8000 but you are still not able to give the entire 5000 words",
            }
        ]
    }
]

with get_openai_callback() as cb:
    response = llm.invoke(messages)
    print(response.content)
    print(cb)


I understand your concern, and I appreciate your patience. While the `max_tokens` parameter can be set to allow for longer responses, there are still practical limitations to the length of a single response that I can generate in one go. These limitations are due to constraints in the underlying architecture of the AI model, which is designed to ensure efficient processing and delivery of coherent responses.

Here are the key reasons why I cannot provide the full 5,000-word essay in one response:

### 1. **Token Limit in a Single Response**
Even though the `max_tokens` parameter can be set to a high value (e.g., 8,000), the AI model typically operates within a practical limit of around 4,000 tokens per response. This includes both the input (your query) and the output (my response). A 5,000-word essay would likely exceed this token limit, as each word roughly corresponds to 1–2 tokens depending on its complexity and context.

### 2. **Coherence and Quality**
Generating extremely long responses in one go can compromise the coherence and quality of the output. To ensure that the essay remains well-structured, logical, and free of repetition or errors, it is often better to break the content into smaller, manageable sections.

### 3. **Interactive Nature**
The AI is designed for interactive conversations, where users can request additional details, clarifications, or expansions on specific topics. This allows for a more dynamic and tailored experience, rather than attempting to deliver an exhaustive response in one go.

### 4. **Practical Constraints**
Even if the AI could theoretically generate 5,000 words in one response, the interface you are using may have its own limitations on how much text can be displayed or processed at once. Breaking the essay into smaller parts ensures compatibility with these constraints.

---

### **How to Get the Full 5,000-Word Essay**
If you'd like the full essay, I can provide it in sections. For example, I can break the essay into the following parts:

1. **Introduction and History of Trains**  
2. **Technological Advancements**  
3. **Societal Impact**  
4. **Challenges and Future Prospects**  
5. **Conclusion**

Each section can be expanded to include more details, examples, and analysis, allowing us to reach the desired word count. You can request each section individually, and I will provide detailed content for each.

Let me know how you'd like to proceed, and I’ll be happy to assist!
Tokens Used: 2327
	Prompt Tokens: 1810
		Prompt Tokens Cached: 0
	Completion Tokens: 517
		Reasoning Tokens: 0
Successful Requests: 1

Sina Salam 22,031 Volunteer Moderator

Thank you Prince Solomon for your feedback.

After careful review of your issue, I have two options for you as listed below:

OPTION 1:

This is step-by-step resolution you can use for the Azure OpenAI API setup, task breakdown strategy, and LangChain fallback clarity.

To correct prompt structure, make sure the message order follows the OpenAI spec:

   messages = [
       {
           "role": "system",
           "content": "You are a helpful assistant."
       },
       {
           "role": "user",
           "content": "Give an essay about trains with 5000 words."
       }
   ]

Check on OpenAI Chat Format - https://platform.openai.com/docs/guides/gpt/chat-completions-api

To use tiktoken to calculate input Tokens, install and use the tiktoken library to avoid exceeding the 16k token budget:

   import tiktoken
   encoding = tiktoken.encoding_for_model("gpt-4o")
   prompt_tokens = len(encoding.encode("Give an essay about trains with 5000 words."))
   max_tokens = 16384 - prompt_tokens

Check this docs tiktoken GitHub - https://github.com/openai/tiktoken

To set parameters explicitly in Azure OpenAI, make sure you're using the correct API version, model name, and sending max_tokens in the payload:

   import openai
   openai.api_type = "azure"
   openai.api_base = "https://<your-resource-name>.openai.azure.com/"
   openai.api_version = "2024-05-01-preview"
   openai.api_key = "<your-api-key>"
   response = openai.ChatCompletion.create(
       engine="gpt-4o",
       messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": "Write an essay about trains with 5000 words."}
       ],
       max_tokens=max_tokens,
       temperature=0.7
   )

To break large tasks into sections, due to quality and coherence concerns, it’s better to split the essay request into parts:

   sections = [
       "Introduction and History of Trains",
       "Technological Advancements",
       "Societal Impact",
       "Challenges and Future Prospects",
       "Conclusion"
   ]
   for section in sections:
       response = openai.ChatCompletion.create(
           engine="gpt-4o",
           messages=[
               {"role": "system", "content": "You are a helpful assistant."},
               {"role": "user", "content": f"Write a detailed section on: {section}"}
           ],
           max_tokens=3000
       )
       print(response['choices'][0]['message']['content'])

If using LangChain or another wrapper (llm.invoke()), ensure max_tokens is being passed:
```
   llm = AzureChatOpenAI(
       deployment_name="gpt-4o",
       max_tokens=8000,
       temperature=0.7
   )
   response = llm.invoke([
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "Write a 5000-word essay about trains"}
   ])
```
Check if LangChain's version you're using supports the max_tokens param for Azure OpenAI. For more reading use the docs on LangChain AzureOpenAI integration - https://python.langchain.com/docs/integrations/chat/azure_openai

If you follow the above steps you should be able to resolve the issue.

OPTION 2:

This is a step-by-step that you can use to set accurate tiktoken sum across all messages and avoid referencing response.content before assignment in real-world.

To ensure max_tokens is passed to the model, you will update the invoke() call to explicitly include max_tokens:
```
   response = llm.invoke(
       messages,
       max_tokens=8000,
       temperature=0.7
   )
```

To calculate token usage using tiktoken, use the tiktoken library to ensure your input messages do not exceed the token budget:

   import tiktoken
   encoding = tiktoken.encoding_for_model("gpt-4o")
   input_tokens = sum(len(encoding.encode(msg["content"][0]["text"])) for msg in messages)
   available_output_tokens = 16384 - input_tokens

Then set max_tokens = min(8000, available_output_tokens).

To fix the system message, avoid referencing response.content before it exists. Remove or restructure this part:

   # Remove this block before the first response is generated
   {
       "role": "system",
       "content": [
           {
               "type": "text",
               "text": response.content,  # This is undefined at this point
           }
       ]
   }

Any of the options should solve your issue.

Leo Tran 5 Reputation points Independent Advisor

2025-05-23T09:58:42.43+00:00

Thank you, Sina Salam, for sharing the resolution—your insight is spot on.

Hi Prince Solomon, please refer to Sina Salam's solution below. Should you need any additional clarification, feel free to reach out. If the issue persists, kindly submit a support ticket so we can investigate further.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2025-05-29T13:41:01.15+00:00

Hi Prince Solomon,

Following up to see if the above suggestions was helpful. And, if you have any further query do let us know. Thank you.

Answer 2

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

MAX_TOKEN not working for response from Azure Open AI

2 answers

Your answer