Up to 25s latency on Azure OpenAI service when using structured outputs in function calling

Emil Lienemann 5 Reputation points
2025-03-18T11:24:14.9+00:00

Hi there,

I am experimenting with the Azure OpenAI service and the latency and response speed were quite satisfactory.

Curiously, the second I enable structured outputs with function calling, the latency starts to go from around 3 seconds to up to 25. Here's a video:

https://share.cleanshot.com/pMJbjB8C

The long latency seems ocurr even when the plugin is not even called (prompts like "hello").

I've tried custom content filters, streaming and adjusting max_tokens, but this huge difference in latency compared to exactly the same call without function calling seems pretty unexplainable.

Note: The JSON I am using is pretty big, around 3k tokens beautified.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,082 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator
    2025-03-18T22:58:14.88+00:00

    Hi Emil Lienemann,

    It sounds like you're encountering significant latency issues when using structured outputs with function calling in the Azure OpenAI service.

    • Reduce JSON Complexity by minimizing the JSON schema size or splitting large responses into multiple function calls to reduce processing overhead and improve response speed.
    • Optimize Token Limits by setting a lower max_tokens value to encourage faster responses and prevent excessive token generation delays.
    • Use Streaming to start receiving responses earlier instead of waiting for full completion, improving perceived latency and responsiveness.
    • Experiment with Smaller Schemas by testing a simpler function-calling schema to reduce the model’s processing time and improve output generation speed.
    • Optimize Azure Region & Model Selection by testing different Azure regions or model versions, as some may have lower latencies due to resource availability and service load.

    For more information: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure%2Cdotnet-entra-id&pivots=programming-language-csharp

    I hope this information helps. Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.