Training Custom Extraction Model in Azure AI Document Intelligence for Medical Data

Question

Training Custom Extraction Model in Azure AI Document Intelligence for Medical Data

Mike Schuckenbrock 0

Seeking advice on effectively training a custom extraction model in Azure AI Document Intelligence, specifically for analyzing cancer treatment documents. The model is designed to extract key data such as chemotherapy received, radiation sites and levels, and surgeries, but it faces several challenges:

Chemotherapy Information: The model recognizes terms in a string containing chemotherapy drugs and biologic agents, but it has low confidence levels and is not fully accurate despite labeling specific terms for the appropriate fields.
Radiation Sites and Doses: While the model captures data accurately, the confidence levels remain too low to be reliable.
Surgery Listings: The model identifies the first surgery correctly but struggles to extract multiple surgeries, even with training examples provided.

Approximately 30-40 example documents have been labeled, covering a range of values across chemotherapy drugs, biologic agents, radiation sites, and different types and numbers of surgeries. Despite following guidelines from Azure documentation, the model continues to struggle with accuracy and confidence levels. Any recommendations for enhancing training techniques would be greatly appreciated.

2 answers

Your answer

Answer 1

To enhance the training of your custom extraction model in Azure AI Document Intelligence for analyzing cancer treatment documents, consider the following recommendations based on best practices:

Increase Training Data: Although you have labeled 30-40 example documents, increasing the dataset can significantly improve model performance. Aim for at least 10-15 examples for each variation of chemotherapy drugs, radiation sites, and surgeries. This helps the model learn from a more diverse set of examples.
Labeling Consistency: Ensure that the terms used for labeling are consistent across all documents. For instance, if you have different ways of describing chemotherapy drugs or surgeries, standardize these terms to help the model recognize them better.
Field Naming: When labeling fields, use clear and relevant names that reflect the content. For example, instead of generic names, use specific identifiers like chemotherapy_drugs or surgery_list. This can improve the accuracy of key-value pair extraction.
Address Variations: Since the model struggles with multiple surgeries, ensure that your training data includes examples with varying numbers of surgeries. Label each surgery distinctly to help the model learn to identify and extract multiple entries.
Quality of Documents: Ensure that the documents used for training are of high quality. Low-quality scans or images can affect the model's ability to accurately extract information. Aim for clear, high-resolution images.
Review and Adjust Training Data: If the model shows low confidence levels, review the training data for any inconsistencies or missing examples. Adding more labeled documents that reflect the expected variations can help improve confidence.
Test and Iterate: Regularly test the model with new documents and iterate on the training process based on the results. If certain fields consistently yield low confidence, consider retraining the model with additional examples focused on those areas.

By applying these strategies, you should be able to enhance the accuracy and confidence levels of your custom extraction model.

References:

Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-03-17T00:50:33.22+00:00

Hi Mike Schuckenbrock,

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank You.

Answer 2

Prashanth Veeragoni 4,930 Microsoft External Staff Moderator

Hi Mike Schuckenbrock,

I understood that the Azure AI Document Intelligence model is struggling to accurately extract cancer treatment details, with low confidence in chemotherapy drugs, radiation data, and multiple surgeries. Even after training with 30-40 documents.

Enhance Model Training Techniques:

Use Prebuilt Models for Boosting Performance:

Instead of training from scratch, fine-tune Microsoft’s Prebuilt Healthcare model available in Azure AI Document Intelligence.

Leverage Azure’s Custom Classification to separate treatment types (Chemotherapy, Radiation, Surgery).

Define Custom Field Relationships:

Use custom fields to specify relationships between drugs, doses, and treatments.

Example: Define a "Chemotherapy Treatment" entity that links to specific drug names and dosages.

Segment Complex Fields:

Instead of extracting all surgeries into one field, use multi-instance fields where each surgery is extracted as a separate entity.

Example:

Surgery Type 1: Appendectomy

Surgery Type 2: Lumpectomy

Refine Labelling Strategy:

Ensure that labelled entities are consistent across all documents. Inconsistent annotations can confuse the model.

Use multiple labelers to cross-validate and remove errors.

Clearly differentiate between chemotherapy drugs vs. biologic agents and ensure they are labelled precisely in context.

Model Configuration & Retraining:

Increase Training Iterations:

If accuracy is low, retrain with different versions of the dataset by removing low-confidence entities and keeping only high-accuracy labels.

Try multiple training runs (3-5 iterations) while adjusting labelled examples.

Augment with Synonyms and Context Awareness:

Medical documents may contain synonyms (e.g., "Adriamycin" vs. "Doxorubicin" for chemotherapy).

Use custom dictionaries or Azure AI Knowledge Mining to handle terminology variations.

Optimize Confidence Thresholds:

If the model has low confidence but correct predictions, adjust post-processing rules to accept lower-confidence values and validate manually.

Post-Processing & Validation:

Apply Rule-Based Validation with Azure Logic Apps:

Use regular expressions (regex) and rule-based filters to validate extracted data, such as:

Chemotherapy drugs must be from a predefined list.

Radiation dose units (Gy, cGy) must be valid.

Use Human Review for Low-Confidence Cases:

Integrate Azure AI Human-in-the-Loop for manual review of low-confidence predictions to improve accuracy over time.

Alternate Approach:

Combine NLP-Based Models with Azure AI:

If Azure AI struggles, use Azure Machine Learning (AML) with NLP models like BERT or ClinicalBERT to extract medical entities with higher accuracy.

Integrate Azure Cognitive Search to index and retrieve structured treatment data.

Hope the above steps help to resolve your issue, if you have any further queries do let us know

Thank you!

Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-03-18T00:57:35.9966667+00:00

Hi Mike Schuckenbrock,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you.
Mike Schuckenbrock 0 Reputation points

2025-03-18T15:01:03.9966667+00:00
Thank you for the reply and follow-up! I'm looking into all the things you suggested. I did want to ask you a few follow-up questions myself :

where in Azure AI Document Intelligence do you see a prebuilt Healthcare model? I am currently using a custom extraction model; I reviewed the prebuilt models but I do not see one related to healthcare except one focused on insurance cards.

can you expand on the "multiple labeler" suggestion?

For the string ection that contains both chemo and biologic agents, I have clearly labeled each term as chemo or biologic agent across 90+ documents but the model still struggles both with accuracy and confidence. I have used every iteration of chemo drug and biologic agent that should be recognized, as well as added in a review of confidence levels and a kickout to manual review for any low confidence items, but i'm trying to get those to be an exception instead of the norm. I will look into mulitple labels for the surgery items as that may work for the forms I'm seeing.
Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-03-19T07:52:42.9066667+00:00

Hi Mike Schuckenbrock,

Since you're working with cancer treatment documents, a fully custom extraction model is the best approach.

Addressing yours Follow-Up Queries:

1.Prebuilt Healthcare Model in Azure AI Document Intelligence

You're correct that Azure AI Document Intelligence does not currently offer a dedicated prebuilt Healthcare model. The prebuilt models available mainly focus on invoices, receipts, ID cards, and contracts.

Leveraging Azure AI Custom Named Entity Recognition (NER) in Azure Machine Learning (AML)

Train a custom NER model using Azure AutoML or Azure Machine Learning’s NLP capabilities.

Fine-tune a pre-trained medical language model (such as ClinicalBERT or BioBERT) for entity extraction.

Combining Azure AI Document Intelligence with Azure Cognitive Search

Use Document Intelligence for structured extraction and Azure Cognitive Search for semantic lookup of medical terms.

2.Expanding on the "Multiple Labeler" Suggestion

The key idea behind using multiple labelers is to eliminate inconsistencies in training data that might be confusing the model. Here’s how you can apply it:

Cross-validation:

Have at least two independent annotators label the same documents.

Compare their labels and resolve discrepancies manually before finalizing the dataset.

Annotation Best Practices:

If a drug name appears in different formats (e.g., "Doxorubicin" vs. "Adriamycin"), decide on a single canonical format.

Define strict annotation rules: Ensure that chemotherapy drugs, biologic agents, radiation sites, and surgery types are consistently labeled across all documents.

Using Azure Machine Teaching for Label Review:

Azure AI provides label review workflows, allowing domain experts to refine entity labels.

Enable Human-in-the-Loop (HITL) review for low-confidence predictions.

3.Improving Model Accuracy & Confidence

Despite clear labelling of chemo drugs and biologic agents, your model is struggling. Here’s how you can fix it:

Increase Training Data Diversity:

Expand beyond 90+ documents by including more variations in:

Drug spellings (brand names vs. generic names)

Sentence structures (e.g., "Patient received Doxorubicin" vs. "Doxorubicin was administered.")

Common abbreviations (e.g., "CAR-T" for chimeric antigen receptor T-cell therapy).

Optimize Feature Engineering:

Instead of only labelling individual words, capture surrounding context.

Example: Instead of labelling just “Doxorubicin,” also label related terms:

"Patient received Doxorubicin 50mg" → Label the entire phrase instead of just "Doxorubicin."

Use a Hybrid Approach (ML + Rule-Based Validation):

Apply regular expressions (regex) for known chemotherapy drugs and biologic agents.

Post-process results using Azure Logic Apps to filter and validate extracted entities.

Fine-Tune the Confidence Threshold:

If low confidence is the issue, try adjusting the post-processing thresholds:

Set a lower confidence threshold for high-frequency terms.

Use manual review only for less common entities.

Retrain with Different Labelling Strategies:

Instead of extracting multiple chemotherapy drugs from a single string, split the field:

Chemotherapy Drug 1: Doxorubicin

Chemotherapy Drug 2: Cyclophosphamide

This improves the model’s ability to recognize multiple entities correctly.

Hope this helps.

Thank you!
Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-03-20T07:30:54.0866667+00:00

Hi Mike Schuckenbrock,

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks,
Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-03-21T05:05:07.1066667+00:00

Hi Mike Schuckenbrock,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank You.
Mike Schuckenbrock 0 Reputation points

2025-03-28T12:54:03.5366667+00:00
I continue to have struggles with this neural custom extraction model. The model has gotten much better in terms of accuracy with the fields I'm extracting, but confidence scores still remain too low to avoid human review; this will be frustrating to reviewers since the data in most cases is correct, it's just that the confidence score is low. Here's what I have done to try to resolve the situation:

Increased training volume - previous model training attempts used 40-60 documents. I have increased that to 300 documents, which represents 25% of the target documents to initially process (with more in the future).

Reviewed all data labels to ensure there are no mistakes on any of those 300 documents; the labeling is accurate.

Increased diversity in training examples - the documents selected comprise 2 data sets; 1) random selection of about 270 documents representing the "real world" variety in data, plus 2) about 30 documents manually created to ensure certain data variations were covered.

Switched from single labels (ex. "Chemotherapy") to capture multiple responses to multiple labels (ex., "Chemo1", "Chemo2", etc.) to distinctly capture data. This has been much more successful from an accuracy perspective.

Switched from capturing data from text strings (ex., string listing of chemo drugs used in treatment) to capturing from more structured lists. This has increased accuracy.

However the confidence scores remain too low to trust the data extracted. When it comes to data like the chemo drugs, a given document may have anywhere from 0-25 drugs listed. I generally am not having issues with Chemo1-Chemo3 being captured with good confidence scores but beyond that confidence scores are not good. In my training set of 300 documents, I have ensured that I have a wide range of documents to cover all situations (some have no chemo drugs, some have up to 25, and everything inbetween).

I'm not fully understanding how the neural model works and what else is needed to train it appropriately. For the fields Chemo1-Chemo25, there is a range of chemo drugs that could be in those fields for a given document but I cannot create every scenario to train the model (ex., documents that have "cisplatin" in Chemo1, or Chemo2, or Chemo3, etc.) as that's not realistic. But the model seems to struggle if it "sees" a drug name in Chemo7 that it never saw in Chemo7 before (although it's seen it in Chemo1, or Chemo2, or Chemo10). Again, it's accurately capturing the drug but the confidence score is low (could be 0.15-0.70).

Suggestions? I'm at a loss as to how to improve this.
Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-04-01T12:53:38.82+00:00

@Mike Schuckenbrock ,

This is a complex issue involving Azure AI Document Intelligence’s neural custom extraction model for medical data. The core problem is not just accuracy but confidence scores, which remain too low even when the model extracts the correct data. Let’s go step by step and address this with a structured approach.

Root Cause Analysis of Low Confidence Scores:

Confidence scores are low beyond Chemo3 → The model is struggling with recognizing patterns beyond the most frequently seen positions.

Confidence scores remain low even with more training data → The model likely lacks a strong contextual understanding of positional variation.

The model does not generalize well to new placements of the same entity → If it sees "cisplatin" in Chemo1 but never in Chemo7 before, it assigns low confidence.

Data augmentation and increased diversity helped accuracy but not confidence → This indicates the model is still overly dependent on position-based learning.

Resolutions & Advanced Techniques:

1.Improve Generalization with Positional Independence

Azure AI Document Intelligence models rely on learned position-based relationships. Your issue suggests that the model is overfitting to positional data instead of understanding entity context.

Solution: Relative Position Encoding & Context Expansion

Instead of labeling Chemo1, Chemo2, ..., Chemo25, consider labeling just "Chemotherapy Drug" across all instances and let the post-processing logic assign order.

Alternative: Use a bounding box-based approach where you extract the chemotherapy drug irrespective of its order in the document.

In the training phase, introduce documents where chemotherapy drugs appear in random orders (e.g., different tables, different list formats).

Implementation:

Flatten Labels: Instead of defining “Chemo1-Chemo25,” create a single “Chemotherapy Drug” label. Augment Layouts: Vary document structures so that drugs appear in multiple positions. Train with Bounding Boxes: Let the model learn to extract the drugs independent of order, then sort them later.

2.Confidence Score Calibration

Even when Azure AI extracts correct values, confidence scores can be low due to underlying model uncertainty.

Solution: Confidence Score Normalization & Ensemble Methods

Instead of relying solely on the Azure AI model’s raw confidence score, apply post-processing techniques:

Histogram-Based Normalization: Adjust confidence scores based on prior distributions in training data.

Hybrid Ensemble with Regex Matching: If extracted drugs match a predefined medical dictionary, boost their confidence scores.

Metadata-Based Scoring: If a chemotherapy drug is listed within a known section of the document, manually raise the score threshold.

Implementation:

Extract all entities using Azure AI Document Intelligence. Compare against a verified drug list (match known drugs → boost score). Recalibrate scores based on prior distributions using a rule-based approach.

3.Use Named Entity Recognition (NER) for Drug Identification

Azure AI Document Intelligence’s custom model struggles with long lists of drugs because it was not specifically designed for complex biomedical text extraction.

Solution: Integrate NLP-Based Clinical NER (e.g., BioBERT, Azure ML)

Train a ClinicalBERT model in Azure Machine Learning that can recognize chemotherapy drug names with higher accuracy.

Pass document text through ClinicalBERT before feeding into Azure AI Document Intelligence.

Combine outputs (if ClinicalBERT recognizes "cisplatin" in Chemo7 with high confidence, override the Azure AI score).

Implementation:

Use BioBERT/ClinicalBERT in Azure Machine Learning for better extraction. Integrate BERT’s results with Azure AI Document Intelligence’s output for confidence adjustment.

4.Augment Model Training with Synonyms & Variations

Medical texts use multiple variations of the same term, affecting model confidence.

Solution: Train Using Synonyms, Abbreviations, and Variations

Expand the training dataset by automatically replacing drug names with synonyms and re-labeling them.

Use Azure Cognitive Search to enrich extracted entities with external medical knowledge.

Implementation:

Generate synthetic training samples with drug variations. Use Azure Cognitive Search for synonym resolution.

5.Post-Processing with Azure Logic Apps for Human-in-the-Loop Review

Implement a confidence threshold re-adjustment workflow.

Rule-Based Processing: If a recognized chemotherapy drug’s confidence is > 0.30 and it matches the medical dictionary, override the model’s confidence.

Implementation:

Flag low-confidence entities for review but automatically approve high-probability matches. Use business rules to override incorrect low scores.

By implementing these, you should see: Higher confidence scores on correctly extracted drugs. Better generalization across document variations. Lower false negatives in post-processing due to threshold tuning.

Hope this helps.

Thank you!
Mike Schuckenbrock 0 Reputation points

2025-04-01T18:55:46.3466667+00:00

Thank you for the very detailed response. My team and I are looking at what is suggested in #2 above in that we'll be comparing the extracted output to a set dictionary of chemo drugs, boosting the confidence score if there's a match.

Regarding #1, I understand that the training of the model may have overfit to positional data; much of the training document volume came from real-world examples that tend to be regimented protocols of drugs based on diagnosis and are listed in alpha-order, so the model may be overfitting due to that. To resolve, I have follow-up questions/comments on suggested methods:
- I cannot flatten labels (ex., "ChemoDrugs" instead of "Chemo1" - "Chemo25") b/c I need to be able to separate the drugs out distinctly; I could potentially do this post-extraction but there would be no delimiters avail to easily separate out terms if captured in one field
- I can introduce custom training documents with chemo drugs in various/random order to break away from the positional overfitting; that will be time-intensive to create those documents but it is doable
- question regarding the bounding boxes approach and how to capture irrespective of order in the document; if there is a list of 9 chemo drugs, I'm currently labeling them where Chemo1 = first drug in the list, Chemo2 = second drug in the list, etc. Would I improve things if I tag label the first drug with, for example, Chemo4, the second drug with Chemo9, etc.? Basically label drugs in the list with random labels from my list? At the end of the day, I don't really care if a captured drug is in Chemo1 field or Chemo12 field, as long as it's captured accurately and with confidence.

We're also looking at enhancing the human review process but if we get a cleaner extraction and match up to the avail dictionaries, I'm hoping that will be good enough.
Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-04-01T22:58:59.9733333+00:00

Mike Schuckenbrock,

"Great insights! We'll focus on boosting confidence with dictionary matching, introducing varied training docs to reduce positional bias, and exploring randomized labeling to improve extraction accuracy."

Let me know if you have any issues.

Thanks
Mike Schuckenbrock 0 Reputation points

2025-04-03T17:49:23.2633333+00:00

Follow up question (and one I have to support as well) - is there an accuracy score available for custom extraction models (neural models)? My understanding that custom neural models don't provide accuracy scores for each extracted field during training (sounds like other models do), but it's also my understanding that a general accuracy score should be available for the model overall; per documentation - "custom template models generate an estimated accuracy score when trained". Support is investigating as well but wanted to ask here.
Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator

2025-04-04T06:54:59+00:00

@Mike Schuckenbrock ,

You're correct — currently, custom neural models in Azure AI Document Intelligence do not provide a model-wide accuracy score during training, unlike custom template models. However, confidence scores are returned at inference time for each extracted field.

To track performance:

Use a validation dataset post-training and compute metrics like precision, recall, and F1-score.

Track confidence scores and compare them with known ground-truth using dictionaries for accuracy estimation.

These steps can serve as a strong substitute for the missing training-time accuracy score, and also help you fine-tune your human review thresholds more effectively.

Thank you!

Share via

Training Custom Extraction Model in Azure AI Document Intelligence for Medical Data

2 answers

Your answer