Hi sivasankar,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!
Setting max_tokens to a lower value, like 800, helps the model generate shorter responses, reducing processing time. This improves response speed while keeping the output useful and relevant.
Enabling response streaming (stream=True) lets the model send parts of the response as soon as they are ready. This makes interactions feel faster since users don’t have to wait for the full response.
Using a shorter input prompt means the model processes less text, which speeds up response time. Keeping prompts clear and concise helps the model focus on what’s important and respond more quickly.
I hope this information helps.