Share via

Model Training Never completes for Document Intelligence Studio > Custom extraction model

rke 0 Reputation points
2026-02-22T17:34:07.47+00:00

I uploaded a form with many fields and after assigning fields to regions, I clicked the Train button and filled out the Model ID form. After a day, the Model Status still says "running". In uploading a filled out form the model was based on, there are no models to select since my extracted model still seems to not be complete and available.

Should I save the extracted model as a new name or any other steps to get the training to finish?

Might it be that my form has too many regions/Fields?

Azure Document Intelligence in Foundry Tools
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator
    2026-02-25T04:15:57.0433333+00:00

    Hello rke ,

    Welcome to Microsoft Q&A. Thank you for reaching out with detailed case description.

    It is understood that model got stuck while training and still shows the status as “running” after a day.

    Hope the model is working fine now.

    In addition to the suggestions provided by Jerald Felix , please let me know if the following are of any help.

    As asked if it might be because of having too many regions/Fields , its typically must not be the case. However, extremely complex labelling (hundreds of overlapping regions, inconsistent tagging, or very large PDFs) can slow training — but not normally freeze it for 24+ hours. Though it takes longer time to finish , that would not stretch for longer than a day.

    While there's no strict “too many fields” failure limit, performance may degrade if hundreds of fields are defined or highly dense region tagging is used. Complex table structures labelled incorrectly can also be the reason.

    Since it has been stuck for over a day, it is very likely a backend training job failure. So, kindly retry by

    1. Deleting the stuck model.
    2. Recreating the model with a new Model ID and the same dataset
    3. Then retraining the model with less complexity, for example say 5 docs, 5 fields.
    4. If successful, then please gradually increase complexity.

    As asked, saving under a new name will not resume the stuck training. It must be retrained.

    Document Intelligence does not support cancelling a training job once started through the portal /UI. It is a Recommended approach to delete the model using REST API as

    DELETE /documentModels/{modelId}

    The following resource can be a useful.

    Document Models - Delete Model - REST API (Azure Azure AI Services) | Microsoft Learn

     

    The following can be the contributing factors for this situation.

    As it's stuck that long, it is most likely that it is

    1. A failed backend training job
    2. Storage access issue
    3. Quota exhaustion
    4. Regional service issue

    To confirm that it is not a model issue alone, kindly check Azure Service Health in the Azure Portal to confirm there are no outages affecting Azure AI Document Intelligence in the region where the resource is deployed.

    While checking for Region-Specific Latency, if the issue persists:

    • Try creating a new Document Intelligence resource in another supported region.
    • Train the same dataset there.
    • Compare results.

    To check if it's because of quota exhaustion, in document Intelligence, kindly check the quotas to verify the training job quota for the subscription, number of custom models already deployed and whether training concurrency limits were exceeded.

    The following reference can be a helpful read on service quotas and limits

    Service quotas and limits - Document Intelligence - Foundry Tools | Microsoft Learn

    Please refer to the below resource to have a headstart

    Build and train a custom model - Document Intelligence - Foundry Tools | Microsoft Learn

    For further extended understanding, please refer to

    Document Intelligence documentation - Quickstarts, Tutorials, API Reference - Foundry Tools | Microsoft Learn 

    Please let me know if you have any questions.

    Thank you!


  2. Jerald Felix 10,975 Reputation points
    2026-02-25T03:55:41.5366667+00:00

    Hello rke,

    Thanks for raising this question in Azure Q&A forum.

    Document Intelligence custom model training getting stuck and never completing is a frequently reported issue, particularly with custom neural models or when using larger datasets. The good news is there are several proven troubleshooting steps and workarounds.

    Common Causes

    The training process hangs most often due to:

    Data validation issues — invalid PDFs, corrupted files, or documents that exceed size limits (500MB per document, 1GB total)

    Quota exhaustion — hitting subscription limits on concurrent training jobs (20 for S0 tier)

    Regional service load — backend processing queues back up during peak times

    Blob storage access problems — expired SAS tokens, incorrect permissions, or networking issues

    Immediate Troubleshooting Steps

    1. Check the training status via API (most reliable) Don't rely solely on the Studio UI poll the API directly to see the actual status:

    python
    from
    

    2. Validate your training dataset Run the Document Intelligence Layout model against all training documents to ensure they can be processed:

    Upload to Document Intelligence Studio → Layout → Test all files

    • Any failures here will cause training to hang indefinitely

    3. Check quotas and limits Azure Portal → Your Document Intelligence resource → Usages and quotas → verify you have remaining training capacity. If at limit, request a quota increase.

    4. Delete the stuck model and restart Stuck models cannot be canceled from the UI. Use the API to force-delete:

    python
    admin_client.begin_delete_model(model_id="stuck-model-id")
    

    Wait 5-10 minutes, then submit a fresh training job.

    Reliable Workarounds

    1. Switch to Template mode (fastest fix) If you can use Template build mode instead of Neural, it trains in minutes instead of hours and is far less prone to hanging:

    text
    build_mode="template"  # Instead of "neural"
    

    2. Reduce dataset size for testing Train with 5-10 documents first to validate the process, then scale up. Large datasets (>100 docs) are much more likely to encounter backend issues.

    3. Use a different region Certain regions (East US, West US 2) have higher capacity and fewer reported training issues. Create a new resource in South Central US or West Europe for production training.

    4. Enable diagnostic logging Resource → Diagnostic settings → Send DocumentIntelligenceModelBuild logs to Log Analytics. Query for failed training jobs:

    text
    DocumentIntelligenceModelBuildStatus | where status == "failed"
    

    When to Escalate

    If training hangs for >48 hours despite the above fixes, open an Azure Support ticket immediately. Include:

    Model ID and training timestamp

    Dataset size and build mode

    Region and resource name

    Log Analytics query results (if enabled)

    Microsoft typically resolves backend-stuck trainings within 24 hours and often issues credits as goodwill.

    If it helps kindly accept the answer.

    Best Regards,

    Jerald Felix


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.