Hello George R,
Welcome to Microsoft Q&A and Thank you for sharing the additional details and the screenshot.
Based on the behavior you’re seeing where SDS_Recogniser_V2 fails repeatedly with “InternalServerError: An unexpected error occurred”, while other models like TDS_Recogniser train without issues this points to a problem that is specific to the SDS_Recogniser_V2 model version or its training dataset, rather than anything related to training limits or document volume.
Below is a consolidated breakdown of what’s going on and how you can proceed.
What we can confirm from your screenshot
TDS_Recogniser trained successfully on Nov 17.
SDS_Recogniser_V3 is currently running.
Multiple SDS_Recogniser_V2 training attempts failed instantly (within seconds/minutes).
This indicates the failure is not caused by:
30-hour timeout
Number of documents/pages
API version
Region or performance issues
The failure occurs before the training pipeline even begins, which strongly suggests an issue with the dataset or the backend state of that specific model version.
The failure occurs before the training pipeline even begins, which strongly suggests an issue with the dataset or the backend state of that specific model version.
Most likely causes of repeated early failures
Based on similar cases, the most common reasons are
1. A corrupted or unsupported document in the SDS V2 dataset
Even a single file that is:
- partially corrupted
- password-protected
- malformed PDF
- extremely large
- unreadable OCR image
can cause an instant generic error with no detailed message.
2. Label schema mismatches
If, some fields were renamed/removed between versions, or
certain documents have missing labels that are not marked optional
the training pipeline will fail during validation.
3. Model version metadata corruption
Sometimes a specific version (V2) becomes “stuck” on the backend. This kind of corruption explains why:
- V2 fails repeatedly
- V3, a new version, is able to run successfully
4. A backend service issue
InternalServerError is a generic fallback when the training service can’t generate a detailed error message. This can happen if there is a pipeline failure in the ingestion/indexing stage.
Troubleshooting Steps:
1. Continue with SDS_Recogniser_V3
V3 is running, which suggests V2’s metadata or internal state is corrupted. If V3 succeeds, that confirms the issue is isolated to SDS_Recogniser_V2.
2. Validate the training dataset
Please check that:
- No PDF/image is password-protected
- No file is 0 bytes
- No file has unusually large or damaged pages
- All labels exist across every sample (or are marked optional)
- All samples use the same schema version
Also Try isolating documents added recently or ones known to have quality issues.
The repeated SDS_Recogniser_V2 failures are not caused by training limits or the number of documents. They are most likely due to:
A corrupted file
A schema mismatch
Or corruption in the V2 model version metadata on the backend
Since SDS_Recogniser_V3 is training successfully, I recommend continuing with that version while we validate the dataset.
I Hope this helps. Do let me know if you have any further queries.
Thank you!