Sol Lee Greetings!
When using a large language model like GPT-3.5 Turbo to build a chatbots application, a phenomenon where the bot becomes 'lazy' after extensive conversations may occur—meaning the responses become incomplete or the bot starts to reuse previous answers. What could be the reasons why the bot becomes lazy? What methods are there to improve this situation?
Looks like the issue you are facing is related to the maximum token length parameter. When you set the maximum token length to a high value, the model tries to use all the tokens available, which can lead to repetitive and incomplete responses.
I would suggest you, please check the documentation on improving the performance and latency.
Also, Here are some of the best practices to lower latency:
- Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
- Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
- Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with
n tokens = n iterations
. Lower the number of tokens generated and overall response time will improve accordingly. - Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
- Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.
Also, see Prompt engineering techniques, Learn how to work with the GPT-35-Turbo and GPT-4 models and Recommended settings and see if that helps.
I Hope this helps. Please let me know if you have any further queries.