Share via

Azure OpenAI Global Batch API jobs stuck in validating indefinitely — multi-region outage

Delora Bradish 20 Reputation points Microsoft Employee
2026-05-21T20:33:44.5+00:00

Related post: This matches a known recurring issue: https://learn.microsoft.com/en-us/answers/questions/5820282/batch-jobs-are-stuck-on-validating-and-then-all-fa — previously resolved via backend fix in March 2026. Requesting the same backend remediation.

Problem:

Since approximately 22:50 on May 20, 2026, all Azure OpenAI Global Batch API jobs are permanently stuck in validating and never progress to in_progress or completed. This blocks our data processing pipeline and we cannot complete any batch jobs.

Affected resources:

Evidence this is a service-side failure:

A single-request batch containing only "Say hello." with max_tokens: 10 was submitted to both resources. Both remain stuck in validating after 4+ minutes of polling. This eliminates batch content, token quota, code, and configuration as possible causes.

Key batch IDs:

  • batch_0aea3966 — 1 request ("Say hello."), East US 2, stuck immediately
  • batch_43ea94a8 — 2 requests, East US 2, stuck since May 20 22:50
  • batch_8c9946fc — 1 request ("Say hello."), North Central US, stuck immediately
  • Last successful batch: batch_9495cf59 — 250 requests, East US 2, completed at 17:11 May 20

Workarounds attempted (all failed):

  • Cancelled all stuck batches to clear quota — no effect
  • Tested API versions 2024-12-01-preview and 2025-04-01-preview — no difference
  • Switched from East US 2 to North Central US region — same result

Impact: Our team is blocked from running our data processing pipeline, which uses the Azure OpenAI Global Batch API (gpt-4.1-global-batch) to analyze data. We are unable to complete any batch jobs. Production runs for multiple months of data are blocked.

Azure OpenAI in Foundry Models
0 comments No comments

Answer accepted by question author

AI answer

2026-05-21T20:33:56.9866667+00:00

This behavior is consistent with a service-side issue in Azure OpenAI Global Batch rather than with batch content, API version, or configuration.

From the available guidance, there are no client-side troubleshooting steps that explain or resolve a situation where:

  • Minimal, valid batches (for example, a single "Say hello." request) remain stuck in validating, and
  • The issue reproduces across multiple regions and API versions after previously successful runs the same day.

The documented troubleshooting items for Global Batch focus on:

  • Input validation errors (for example, invalid_json_line, model_not_found, invalid_request, too_many_tasks, url_mismatch, model_mismatch, empty_file, duplicate_custom_id, input_modified, input_no_permissions). These surface as explicit error codes, not as jobs stuck indefinitely in validating.
  • Known client-side issues such as:
    • Resources deployed via Azure CLI having non-standard endpoint subdomains that prevent Global Batch from working out of the box.
    • UTF-8-BOM–encoded JSONL files being rejected.
    • Modifying the input blob after submission causing the job to fail.

None of these match a scenario where jobs never leave validating and do not transition to a failed state with an error code.

Given the symptoms and the prior similar incident that required a backend fix, the next actionable steps are:

  1. Treat this as a potential regional or platform incident for Azure OpenAI Global Batch.
  2. Check Azure Status and Service Health in the Azure portal for incidents affecting Azure OpenAI in the impacted regions (East US 2 and North Central US).
  3. If no incident is shown, open a support case with Azure Support, providing:
    • Resource regions and deployment names (for example, gpt-4.1-global-batch in East US 2 and North Central US).
    • The affected batch IDs (batch_0aea3966, batch_43ea94a8, batch_8c9946fc) and the last successful batch ID (batch_9495cf59) with timestamps.
    • Confirmation that minimal test batches with valid JSONL input are also stuck in validating.

The support team can then correlate these batch IDs and regions with backend telemetry and apply or request the necessary service-side remediation.


References:

AI-generated content may be incorrect. Read our transparency notes for more information.

Was this answer helpful?

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.