An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello @Diksha Golait ,
Welcome to Microsoft Q&A .Thank you for reaching out to us.
The behavioiural pattern where Global Batch jobs remain in the “validating” state for extended durations across multiple regions with reduced but non-zero throughput — is consistent with a service-side validation queue delay rather than an issue related to input format, configuration, or SDK usage.
This pattern typically indicates backend capacity constraints or processing backlog, where limited validation capacity remains available. As a result, a small number of jobs continue to complete while the majority stay queued.
The following actions can help maintain partial throughput and assess recovery
- Submitting smaller batch jobs
- Split large JSONL files into smaller datasets
- Smaller workloads are more likely to move through validation under constrained conditions
- Testing with a small new batch
- Consider submitting a lightweight batch job
- Then observe whether it transitions to in_progress to validate system behavior
- Staggering job submissions
- Please avoid submitting multiple batches simultaneously
- Introduce intervals between submissions to reduce queue contention
- If available test alternate region or deployment
- Submitting a small workload in another supported region
- This helps identify whether impact is localized or broader
- Selective cancellation only for long-stuck jobs
- Jobs stuck for extended periods (for example, beyond 48–72 hours) are unlikely to progress
- If required, please cancel a limited subset and re-submit as smaller batches
- Please avoid bulk cancellation, as it may increase queue pressure
Jobs that remain in the “validating” state for multiple days typically do not progress further through client-side actions. Resolution in such cases generally requires backend intervention.
For Monitoring and visibility , the Azure service status page may be reviewed periodically; however, partial degradations may not always be reflected.
Diagnostic metrics such as validation duration or queue trends can provide visibility but do not unblock existing jobs
The following references might be helpful , please check them out
Thank you