Azure OpenAI Batch Jobs are getting stuck in Validating stage

Maksim Iudin 45 Reputation points
2025-09-12T13:34:32.72+00:00

Hello Azure Support Team,

We’re experiencing a severe delay with our Azure OpenAI batch processing.
Over the past several days, more than a thousand of our batch requests have been stuck in the “Validating” state for over 48 hours before progressing. This behavior is blocking our workflows.
image.png

Details:

Region: Sweden Central

Status: Batch jobs remain in “Validating” for 48+ hours before moving to “In Progress.”

Quota: We’re operating well below our quota limits.

Impact: This has caused a large backlog of requests, severely delaying our data processing.

Examples of affected Batch IDs:

batch_67a323b0-edeb-47e7-ba29-779c8d717a17

batch_d7698114-4659-4a1e-8a35-7386943650fb

Observed Pattern:
Jobs created on September 10 took almost two days to leave “Validating” and only then began processing (example: batch_7d1b5132-ce37-4739-8ea7-b2d27b7583f9).

References:
We’ve seen multiple reports of the same issue on Microsoft forums:

https://learn.microsoft.com/en-us/answers/questions/2247507/azure-openai-batch-jobs-are-getting-stuck-in-valid

https://learn.microsoft.com/en-us/answers/questions/2237708/azure-openai-batch-jobs-are-getting-stuck-in-valid

https://learn.microsoft.com/en-us/answers/questions/5551588/azure-openai-batch-jobs-are-getting-stuck-in-valid

Request:
Could you confirm whether this is a known issue in Sweden Central? The long validation times make the service unusable for our workflows.

Thanks in advance!

Best regards,
Maksim Iudin

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
0 comments No comments
{count} vote

Answer accepted by question author
  1. Nikhil Jha (Accenture International Limited) 3,985 Reputation points Microsoft External Staff Moderator
    2025-09-16T09:27:22.2566667+00:00

    Hello Maksim Iudin,

    The long validation times you are observing in Sweden Central are part of the same known service issue impacting Azure OpenAI batch jobs across multiple regions. While the incident was most visible in West Europe, Australia East, South Central US, and East US, similar effects have been seen in other regions, including Sweden Central, due to the way concurrency limits, and dispatcher logic interact globally.

    What happened

    A dispatcher logic bug caused jobs cancelled after validation to remain stuck in Cancelling.

    A missing output collation resource (initially in West Europe) contributed to jobs getting stuck in Finalizing.

    Together, these issues created regional backlogs, which then affected scheduling and validation performance across linked regions such as Sweden Central.

    Current status

    • A hotfix has been deployed to correct the invalid state transitions.
    • GenevaActions and manual interventions were run to move most jobs into terminal states.
    • Concurrency and resource provisioning improvements are in progress to reduce the backlog and restore normal throughput.

    Please accept the answer and upvote, for other community member remediation.
    Thank you for helping Q&A community. 😊

    1 person found this answer helpful.

Answer accepted by question author
  1. Sina Salam 26,661 Reputation points Volunteer Moderator
    2025-09-16T09:11:22.3866667+00:00

    Hello Maksim Iudin,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your Batch Jobs are getting stuck in validating stage.

    Yes, this was a known issue since September 9, though some users reported that their jobs eventually processed after long delays, and others still face the issue, especially with GPT-4.1, GPT-4.1-mini, and GPT-4o models. While Microsoft teams are working on this, you can resolve Azure OpenAI batch jobs stuck in the “Validating” state by reducing the number of batch, switching to more stable models like GPT-o3-mini in regions such as East US 2, and verify that your storage containers have the correct Storage Blob Data Contributor role. Additionally, implementing retry logic using the Azure OpenAI Python SDK and monitoring Azure Service Health can help mitigate delays and improve reliability.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.