Azure OpenAI Batch Jobs are getting stuck in Validating stage

Vikram Shah 20 Reputation points
2025-09-11T04:45:08.8066667+00:00

The batch jobs appear to have been stuck since September 9th, 9:45 PM. Could someone help me with this?

Please note:

  • I currently have over 100 batch requests in validating status, so cancelling and restarting them is not a feasible option.
  • Both the batch size and queue tokens are well within the limits.

Screenshot 2025-09-11 at 9.43.47 AM

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
{count} votes

2 answers

Sort by: Most helpful
  1. Alex Burlachenko 18,565 Reputation points Volunteer Moderator
    2025-09-11T07:25:14.4633333+00:00

    oh wow, Vikram, that is a seriously frustrating situation. over 100 batches stuck? that's a real pain ((((

    this validating stage hang is a known hiccup that sometimes happens with the azure openai batch api. its usually not anything u did wrong in your request. the service itself gets a bit overwhelmed or hits an internal snag while trying to queue your job.

    since canceling them all one by one isnt an option, u need a broader solution. the first thing to try is reaching out to azure support directly. they have tools on their end to look at your specific jobs and potentially force them out of this stuck state. be sure to give them a few of those batch ids from your list, like batch_a07da4b7-c10f-4c11-80fa-40f05f542641.

    while u wait for them, also check the health status of the azure openai service in your region. sometimes regional issues can cause this kind of behavior. u can check that on the azure status page https://status.azure.com/status

    its also worth looking into your storage account. make sure the container holding your input files is still accessible and that the managed identity for openai batch processing still has the 'storage blob data contributor' role on it. a permission glitch could sometimes be the culprit.

    this might help in other tools too, always double check that storage link.

    unfortunately, if support cant push them through, u might be looking at a waiting game until the system automatically clears them out after some time. i know thats not the answer u want to hear, but its the reality of the platform sometimes.

    really hope support can get this sorted for u quickly. those stuck jobs are the worst ))

    Best regards,

    Alex

    and "yes" if you would follow me at Q&A - personaly thx.
    P.S. If my answer help to you, please Accept my answer
    

    https://ctrlaltdel.blog/

    0 comments No comments

  2. Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator
    2025-09-12T08:50:18.8866667+00:00

    Hello Vikram Shah,

    I hope community member Alex Burlachenko’s inputs helped clarify some aspects.

    As mentioned, this might be due to unsupported operations or intermittent server-side issues.
    Attached known issues sections for reference.

    For future reference, here’s a step-by-step troubleshooting process you can follow:

    1. Check error details – Review the errors property for each file in the batch output and compare against the Error code section documentation.
    2. Test with smaller inputs – Start with smaller files or batches to isolate whether the issue is linked to file content, formatting, or size. Once validated, scale up gradually within the documented batch size and quota limits.
    3. Retry with backoff logic – Implement exponential backup retry logic. to handle transient failures.
    4. API version / deployment selection – Try switching to a newer API version or opt for Global Standard deployments, which automatically route requests to the most stable data centers and provide higher throughput (TPM) and rate limits.
    5. Python SDK debugging – If possible, use the Azure OpenAI Python SDK for better error visibility and iterative testing.

    Additional context from similar threads:

    • Regional capacity or traffic load can cause validation bottlenecks.
    • Model version matters — older models may show inconsistent behavior.
    • Quota/token or enqueue limits can delay jobs, even if usage seems under your quota.
    • Input formatting issues (special characters, malformed JSON, file encoding) can trigger silent failures.
    • In some cases, simply waiting or re-submitting jobs resolved the problem.

    Reference: Batch operation OpenAI
    Disclaimer: The document "backup retry logic" mentioned is not maintained by Microsoft. It is being shared solely for your convenience.

    Hope this helps you narrow down the cause and stabilize your batch processing.


    Please accept the answer and upvote, for other community member remediation.
    Thank you for helping Q&A community. 😊

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.