Hi Yl,
Azure AI Inference has a hard timeout of 120 seconds, which means that any request exceeding this limit will be automatically cut off. This means that if a model takes longer than 2 minutes to generate a response, the request is automatically terminated by Azure.
To resolve this, I recommend:
- Reducing
max_tokens
(e.g., setmax_tokens=800
). - Enabling response streaming (
stream=True
) to avoid waiting for the full response. - Shortening the input prompt to minimize processing time.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.