Frequent Instability Across all LLM Models

Question

Frequent Instability Across all LLM Models

Dan Hastings 10

I have been noticing a lot of instability across a range of models from the foundry.

Model Timeouts

Completion timeouts are now problematic across all models. Some models are completely unusable and never respond, others are very slow, others are sporadic and just fail randomly. What makes me think this is an issue with Azure is the fact that i have seen this happen with many models for simple queries like "tell me a joke" and it times out after 600 seconds.

Grok: pretty much never works. It always times out after 600 seconds. Its very rare i can get a completion from Grok models, even if its a very simple prompt like "tell me a joke"
Deepseek: The deepseek models do complete more often but it can take up to a minute for it to tell a joke or return a simple prompt. Streaming has not helped as it still returns a few letters and then nothing for several seconds.
OpenAI: Up until recently, OpenAI had been far more stable but recently it is also timing out a lot. Its happening daily and while we have retries in place to handle it, the time wasted waiting for responses that never come are becoming a problem. Its regularly returning errors like the following

{

"error": {

"message": "azure returned an error: upstream connect error or disconnect/reset before headers. reset reason: connection timeout",

"status": 503,

"code": "UPSTREAM_SERVER_ERROR",

"provider": "azure",

"providerUrl": "https://xxx.cognitiveservices.azure.com/openai/v1/responses",

"model": "azure/gpt-5.1-codex-mini",

"upstreamStatus": 503,

"upstreamResponse": "upstream connect error or disconnect/reset before headers. reset reason: connection timeout"

}

Aggressive Content Filter

This has been showing up a lot in the past week and I cant work out why its happening so much. I have seen it return this error when reviewing source code which really makes no sense at all. It has become a common error and every time I have investigated, I cant find anything that makes sense

status 422: {"error":{"message":"Provider returned an incomplete response (reason: content_filter). The provider's content safety filter blocked the model output.","status":422,"code":"INCOMPLETE_RESPONSE","incompleteReason":"content_filter"}}

This one is quite problematic in that we dont know why its blocking in the first place. Its so vague and we cant identify anything different in the content. Especially when its source code.

0 comments

3 answers

Your answer

Answer 1

AlbertoCunha-5626 10

Thank you for post. I'm experiencing the same problem on the lasts weeks and I was wondering if it was a change made by me, but it seems it's a problem with azure itself. We request removal of filters for our tasks (government) and they are not the problem. Inferences that would take 10s now are taking 60s+ with the gpt-5-mini model and deepseek v3.2.

Amer Jusuf 0 Reputation points

2026-04-20T14:52:04.7733333+00:00

We're experiencing the same issue. Towards the end of last week, response times for GPT-5-mini jumped from ~50s to 350–700s, with occasional timeouts. It has since improved to around 150s for the same calls, but that's still slower than what we were used to. Reducing the reasoning effort helped somewhat, though not enough to restore previous performance (~150s-->~100s). Microsoft Support indicated no instability on their end, but noted that pay-as-you-go deployments do not come with low-latency guarantees.

Answer 2

Hi Dan Hastings,

Thank you for the detailed description—this is very helpful. Based on what you’ve shared, you’re seeing two overlapping behaviors:

Intermittent timeouts / 503 errors across multiple models, and

Unexpected content filtering responses (422)

I’ve reviewed your observations against documented Azure behavior, and your concern is valid. Below is a consolidated explanation and guidance.
Timeouts, 503 Errors, and Model Instability

The error:

“upstream connect error or disconnect/reset before headers (503)”

typically indicates transient backend failures or capacity constraints, especially when:

It occurs across multiple models
It affects even simple prompts

Responses hang for long durations (e.g., 600 seconds)

What you can do

Check Azure Service Health for any regional incidents or degradation
Monitor metrics (latency, failures, throttling) in your Azure resource
Validate quotas and request rates to ensure you’re within limits
Implement retries with exponential backoff for 5xx/429 errors
Test with different regions or GA models (preview models often have lower stability)
Reduce payload size (prompt + max_tokens) to minimize processing time

These steps align with Microsoft best practices for handling transient faults and improving reliability.
Content Filtering (422 – content_filter)

The error:

“incomplete response (reason: content_filter)”

means the response was blocked or truncated by Azure AI Content Safety, which evaluates both prompts and outputs.

This can occasionally happen even with benign inputs (e.g., source code) due to how the model interprets patterns.
What you can do

Inspect the content filter result in the API response to identify the triggered category
Review your content filtering configuration (if using Foundry / Model Catalog)

Adjust severity thresholds or request modified filtering if your use case requires it

Important Clarification

Even though the above steps will help improve reliability:

The pattern you’re seeing (timeouts + 503 across models) strongly suggests backend capacity or regional service instability, particularly for preview models

This is not always fully resolvable from the client side

The behavior you’re experiencing is consistent with transient service-side instability and content safety enforcement, rather than a single configuration issue.

The recommendations above will help:

Improve resiliency (via retries and optimization)
Reduce latency and failures

Provide better visibility into filtering behavior

Microsoft Reference Links:

Azure Service Health https://learn.microsoft.com/azure/service-health/service-health-overview
Monitor Azure OpenAI (metrics, latency, failures) https://learn.microsoft.com/azure/ai-services/openai/how-to/monitor
Azure OpenAI quotas and limits https://learn.microsoft.com/azure/ai-services/openai/quotas-limits
Retry pattern best practices https://learn.microsoft.com/azure/architecture/best-practices/retry-service-specific
Azure OpenAI content filtering concepts https://learn.microsoft.com/azure/ai-services/openai/concepts/content-filter
Azure AI Content Safety overview https://learn.microsoft.com/azure/ai-services/content-safety/overview
Request for modified content filters https://aka.ms/oai/modifiedfilters

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
Please do not forget to "Accept Answer" and "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.

SAI JAGADEESH KUDIPUDI 3,205 Reputation points Microsoft External Staff Moderator

2026-05-02T18:36:11.5866667+00:00

Hi Dan Hastings,
I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.

Answer 3

AI answer

The described behavior aligns with two main areas covered in the context: timeouts/instability and increasingly frequent content-filter blocks.

1. Timeouts, slow responses, and 503 errors

From the context, there are several relevant points:

Client-side timeouts vs service latency
- For non-reasoning models, guidance is:
  - Streaming: set client-side timeout up to 60 seconds.
  - Non‑streaming: up to 29 minutes.
- For reasoning models, requests (including streaming) can take up to 29 minutes because all reasoning tokens are generated before the first token is returned.
- Long-running calls can be canceled by client-side timeouts if those timeouts are set too low.
Actionable steps:
- Explicitly set client-side timeouts according to the guidance for each workload and model type, and tune them to traffic patterns.
- For long or complex prompts, increase the timeout above 60 seconds if not using streaming, or ensure the HTTP client/proxy in front of the app is not enforcing a shorter timeout than the SDK.
Use streaming to avoid client/proxy timeouts
- Streaming is recommended when calls take a long time to process, because many clients and intermediary layers have their own timeouts.
- Streaming ensures incremental data is received, improving perceived latency and reducing the chance that intermediaries cancel the request before any data is seen.
In the DeepSeek/OpenAI scenarios where only a few characters arrive and then nothing for a long time, this suggests either:
- The model is still generating but is slow (large context, complex reasoning, or cache misses), or
- Some intermediary (gateway, proxy, or client) is interrupting the stream.
Actionable steps:
- Confirm that the HTTP client, reverse proxies, and any API gateways allow long‑lived streaming connections and do not buffer or cut them off early.
- If streaming is enabled but still timing out, increase the upstream timeout in the gateway/load balancer to be at least as large as the model guidance (or the app’s own timeout, whichever is smaller).
Service-side issues and region-specific instability
- The context shows that sometimes issues are region-specific and not clearly surfaced in error messages. A deployment that failed repeatedly in one region succeeded immediately when moved to another region, even though the error did not mention a regional problem.
- Another example shows Document Intelligence requests timing out both from code and from the Studio, indicating a service-side or regional issue rather than client code.
Actionable steps:
- Check Azure Status and the Service Health blade in the Azure portal for the regions where the Foundry models are deployed to see if there are incidents or degraded performance.
- If possible, test the same prompts against a deployment in a different region to see if the behavior is regional.
- If the same simple prompts (like “tell me a joke”) consistently time out across multiple models and regions, capture correlation IDs and timestamps and open a support request so the service team can investigate upstream 503/timeout patterns.
HTTP 429 and 5xx handling
- For rate limits (429), the guidance is to implement retry with exponential backoff and respect Retry-After.
- For 5xx errors like the 503 “upstream connect error or disconnect/reset before headers. reset reason: connection timeout”, this indicates the upstream service did not respond in time.
Actionable steps:
- Ensure robust retry logic for transient 5xx errors with backoff and jitter.
- Log and surface the upstreamStatus, upstreamResponse, and any correlation IDs to support for deeper analysis.
Performance tuning
- Separate workloads: mixing very different workloads on the same endpoint can hurt latency due to batching and cache contention.
- Prompt size and generation size both affect latency; large prompts and large max_tokens will increase response time.
- Batching can reduce the number of requests and sometimes improve overall throughput.
Actionable steps:
- Use separate deployments/endpoints for very different workloads (e.g., short chat vs long code analysis) to avoid cross‑impact on latency.
- Review prompt sizes and generation lengths for the slowest workloads and reduce where possible.

2. Aggressive content filter and 422 `content_filter` errors

The context describes Azure OpenAI’s content filtering system:

Prompts and completions are run through an ensemble of classification models to detect potentially harmful content.
If a prompt is flagged, the API returns an error with error.code = "contentFilter" and a message like “Your task failed as a result of our safety system.”
It is also possible for the generated output itself to be filtered, in which case the error message is “Generated image was filtered as a result of our safety system” (for images), or a similar contentFilter error for text.
The system increases safety but also adds latency.

Applied to the 422 error:

The message "Provider returned an incomplete response (reason: content_filter). The provider's content safety filter blocked the model output." indicates that the model started generating a response, but the content safety system blocked part or all of that output.
This can happen even when the input seems benign (e.g., source code) if the model’s output is classified as falling into a restricted category (for example, if the code or comments resemble disallowed content, or if the model attempts to generate something that matches a blocked pattern).

Actionable steps:

Understand that both input and output are filtered
- Even if the input is just source code, the model’s attempted completion may contain patterns that trigger filters.
Inspect prompts and partial outputs where possible
- Log the exact prompt and any partial completion (if available) when a content_filter error occurs to look for patterns (specific libraries, function names, comments, or strings that might resemble disallowed content).
Request content filter policy adjustments for low-risk workloads
- The context notes that content filtering can be modified for certain lower‑risk use cases to improve performance and reduce unnecessary blocking.
- For scenarios like internal source-code review where the risk profile is lower, it is possible to request modifications to the default content filtering policies.
Account for filter latency and failures in the client
- The content filtering system adds latency; for workloads that are very sensitive to latency and are low risk, adjusted policies can help.
- In the client, treat content_filter as a distinct failure mode and decide whether to:
  - Retry with a more constrained prompt (e.g., ask for a summary instead of full code rewrite), or
  - Fall back to a different workflow (e.g., manual review) when blocked.
Use RAG or system prompts to constrain outputs
- For code-review scenarios, constrain the model via system prompts to only produce certain types of output (e.g., “Only output a bullet list of issues, no code snippets”), which can reduce the chance of generating content that triggers filters.

3. Next steps to stabilize the environment

Given the symptoms across multiple providers/models:

Explicitly set and tune client-side timeouts per model type (reasoning vs non‑reasoning, streaming vs non‑streaming) following the quotas/limits guidance.
Verify that all intermediaries (API gateways, load balancers, proxies) allow long‑running and streaming requests and are not enforcing shorter timeouts.
Separate high-latency/complex workloads from simple chat workloads into different deployments/endpoints.
Monitor Azure Service Health and, if possible, test in an alternate region to rule out regional service issues.
Implement structured logging for all errors, including 503s and content_filter 422s, capturing timestamps, regions, model names, and correlation IDs, and use these when opening a support case.
For the aggressive content filter behavior, review the affected prompts/outputs and consider requesting content filter policy adjustments for low-risk, internal scenarios like source-code analysis.

References:

AI-generated content may be incorrect. Read our transparency notes for more information.

Share via

Frequent Instability Across all LLM Models

Model Timeouts

Aggressive Content Filter

3 answers

1. Timeouts, slow responses, and 503 errors

2. Aggressive content filter and 422 content_filter errors

3. Next steps to stabilize the environment

Your answer

2. Aggressive content filter and 422 `content_filter` errors