Up to 25s latency on Azure OpenAI service when using structured outputs in function calling

Question

Up to 25s latency on Azure OpenAI service when using structured outputs in function calling

Emil Lienemann 5

Hi there,

I am experimenting with the Azure OpenAI service and the latency and response speed were quite satisfactory.

Curiously, the second I enable structured outputs with function calling, the latency starts to go from around 3 seconds to up to 25. Here's a video:

https://share.cleanshot.com/pMJbjB8C

The long latency seems ocurr even when the plugin is not even called (prompts like "hello").

I've tried custom content filters, streaming and adjusting max_tokens, but this huge difference in latency compared to exactly the same call without function calling seems pretty unexplainable.

Note: The JSON I am using is pretty big, around 3k tokens beautified.

Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator

2025-03-20T01:08:24.8233333+00:00

Hi Emil Lienemann,
Did you get any chance to check the response. Thank you!

1 answer

Your answer

Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator

2025-03-20T01:08:24.8233333+00:00

Hi Emil Lienemann,
Did you get any chance to check the response. Thank you!

Answer 1

Pavankumar Purilla 8,335 Microsoft External Staff Moderator

Hi Emil Lienemann,

It sounds like you're encountering significant latency issues when using structured outputs with function calling in the Azure OpenAI service.

Reduce JSON Complexity by minimizing the JSON schema size or splitting large responses into multiple function calls to reduce processing overhead and improve response speed.
Optimize Token Limits by setting a lower max_tokens value to encourage faster responses and prevent excessive token generation delays.
Use Streaming to start receiving responses earlier instead of waiting for full completion, improving perceived latency and responsiveness.
Experiment with Smaller Schemas by testing a simpler function-calling schema to reduce the model’s processing time and improve output generation speed.
Optimize Azure Region & Model Selection by testing different Azure regions or model versions, as some may have lower latencies due to resource availability and service load.

For more information: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure%2Cdotnet-entra-id&pivots=programming-language-csharp

I hope this information helps. Thank you!

Emil Lienemann 5 Reputation points

2025-03-20T08:17:43.6633333+00:00
Hey @Pavankumar Purilla - I've tried the following as requested:

max_tokens

streaming

different Azure regions

The latency persists.

The schema complexity and model choice are unfortunately a necessity for this project. I've also tried the same on the official OpenAI API and had a delay similar to Azure on the first call, but not on subsequent requests. Could it be possible that OpenAI is cacheing a representation of the schema here that Azure is not?
Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator

2025-03-21T17:39:45.9266667+00:00

Hi Emil Lienemann,
Azure OpenAI does not cache the schema and reprocesses it for every request.

Share via

Up to 25s latency on Azure OpenAI service when using structured outputs in function calling

1 answer

Your answer