OpenAI model becomes 'lazy': the responses become incomplete or the model starts to reuse previous answers

Sol Lee 0 Reputation points Microsoft Employee
2024-04-22T19:25:34.3933333+00:00

When using a large language model like GPT-3.5 Turbo to build a chatbots application, a phenomenon where the bot becomes 'lazy' after extensive conversations may occur—meaning the responses become incomplete or the bot starts to reuse previous answers. What could be the reasons why the bot becomes lazy? What methods are there to improve this situation?

A fictional scenario with a chatbot service**: Online tech support chatbot**
User: Hi, my laptop has been overheating whenever I play games. What should I do?
Chatbot: Hi! It sounds like an overheating issue. Please ensure that your laptop is in a well-ventilated area, and check if the cooling fans are running properly.
User: Okay, I checked, and the fans are working fine. But it’s still overheating.
Chatbot: Please ensure that your laptop is in a well-ventilated area, and check if the cooling fans are running properly.
User: You just said that. Is there anything else I can try?
Chatbot: Please ensure that your laptop is in a well-ventilated area.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,251 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. AshokPeddakotla-MSFT 28,221 Reputation points
    2024-04-23T02:39:53.68+00:00

    Sol Lee Greetings!

    When using a large language model like GPT-3.5 Turbo to build a chatbots application, a phenomenon where the bot becomes 'lazy' after extensive conversations may occur—meaning the responses become incomplete or the bot starts to reuse previous answers. What could be the reasons why the bot becomes lazy? What methods are there to improve this situation?

    Looks like the issue you are facing is related to the maximum token length parameter. When you set the maximum token length to a high value, the model tries to use all the tokens available, which can lead to repetitive and incomplete responses.

    I would suggest you, please check the documentation on improving the performance and latency.

    Also, Here are some of the best practices to lower latency:

    • Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
    • Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
    • Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
    • Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
    • Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

    Also, see Prompt engineering techniques, Learn how to work with the GPT-35-Turbo and GPT-4 models and Recommended settings and see if that helps.

    I Hope this helps. Please let me know if you have any further queries.

    1 person found this answer helpful.
    0 comments No comments