Training Custom Extraction Model in Azure AI Document Intelligence for Medical Data

Mike Schuckenbrock 0 Reputation points
2025-03-10T20:53:37.26+00:00

Seeking advice on effectively training a custom extraction model in Azure AI Document Intelligence, specifically for analyzing cancer treatment documents. The model is designed to extract key data such as chemotherapy received, radiation sites and levels, and surgeries, but it faces several challenges:

  1. Chemotherapy Information: The model recognizes terms in a string containing chemotherapy drugs and biologic agents, but it has low confidence levels and is not fully accurate despite labeling specific terms for the appropriate fields.
  2. Radiation Sites and Doses: While the model captures data accurately, the confidence levels remain too low to be reliable.
  3. Surgery Listings: The model identifies the first surgery correctly but struggles to extract multiple surgeries, even with training examples provided.

Approximately 30-40 example documents have been labeled, covering a range of values across chemotherapy drugs, biologic agents, radiation sites, and different types and numbers of surgeries. Despite following guidelines from Azure documentation, the model continues to struggle with accuracy and confidence levels. Any recommendations for enhancing training techniques would be greatly appreciated.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Q&A Assist
    2025-03-10T20:53:37.28+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To enhance the training of your custom extraction model in Azure AI Document Intelligence for analyzing cancer treatment documents, consider the following recommendations based on best practices:

    1. Increase Training Data: Although you have labeled 30-40 example documents, increasing the dataset can significantly improve model performance. Aim for at least 10-15 examples for each variation of chemotherapy drugs, radiation sites, and surgeries. This helps the model learn from a more diverse set of examples.
    2. Labeling Consistency: Ensure that the terms used for labeling are consistent across all documents. For instance, if you have different ways of describing chemotherapy drugs or surgeries, standardize these terms to help the model recognize them better.
    3. Field Naming: When labeling fields, use clear and relevant names that reflect the content. For example, instead of generic names, use specific identifiers like chemotherapy_drugs or surgery_list. This can improve the accuracy of key-value pair extraction.
    4. Address Variations: Since the model struggles with multiple surgeries, ensure that your training data includes examples with varying numbers of surgeries. Label each surgery distinctly to help the model learn to identify and extract multiple entries.
    5. Quality of Documents: Ensure that the documents used for training are of high quality. Low-quality scans or images can affect the model's ability to accurately extract information. Aim for clear, high-resolution images.
    6. Review and Adjust Training Data: If the model shows low confidence levels, review the training data for any inconsistencies or missing examples. Adding more labeled documents that reflect the expected variations can help improve confidence.
    7. Test and Iterate: Regularly test the model with new documents and iterate on the training process based on the results. If certain fields consistently yield low confidence, consider retraining the model with additional examples focused on those areas.

    By applying these strategies, you should be able to enhance the accuracy and confidence levels of your custom extraction model.


    References:


  2. Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator
    2025-03-14T07:53:38.77+00:00

    Hi Mike Schuckenbrock,

    I understood that the Azure AI Document Intelligence model is struggling to accurately extract cancer treatment details, with low confidence in chemotherapy drugs, radiation data, and multiple surgeries. Even after training with 30-40 documents.

    Enhance Model Training Techniques:

    Use Prebuilt Models for Boosting Performance:

    Instead of training from scratch, fine-tune Microsoft’s Prebuilt Healthcare model available in Azure AI Document Intelligence.

    Leverage Azure’s Custom Classification to separate treatment types (Chemotherapy, Radiation, Surgery).

    Define Custom Field Relationships:

    Use custom fields to specify relationships between drugs, doses, and treatments.

    Example: Define a "Chemotherapy Treatment" entity that links to specific drug names and dosages.

    Segment Complex Fields:

    Instead of extracting all surgeries into one field, use multi-instance fields where each surgery is extracted as a separate entity.

    Example:

    Surgery Type 1: Appendectomy

    Surgery Type 2: Lumpectomy

    Refine Labelling Strategy:

    Ensure that labelled entities are consistent across all documents. Inconsistent annotations can confuse the model.

    Use multiple labelers to cross-validate and remove errors.

    Clearly differentiate between chemotherapy drugs vs. biologic agents and ensure they are labelled precisely in context.

    Model Configuration & Retraining:

    Increase Training Iterations:

    If accuracy is low, retrain with different versions of the dataset by removing low-confidence entities and keeping only high-accuracy labels.

    Try multiple training runs (3-5 iterations) while adjusting labelled examples.

    Augment with Synonyms and Context Awareness:

    Medical documents may contain synonyms (e.g., "Adriamycin" vs. "Doxorubicin" for chemotherapy).

    Use custom dictionaries or Azure AI Knowledge Mining to handle terminology variations.

    Optimize Confidence Thresholds:

    If the model has low confidence but correct predictions, adjust post-processing rules to accept lower-confidence values and validate manually.

    Post-Processing & Validation:

    Apply Rule-Based Validation with Azure Logic Apps:

    Use regular expressions (regex) and rule-based filters to validate extracted data, such as:

    Chemotherapy drugs must be from a predefined list.

    Radiation dose units (Gy, cGy) must be valid.

    Use Human Review for Low-Confidence Cases:

    Integrate Azure AI Human-in-the-Loop for manual review of low-confidence predictions to improve accuracy over time.

    Alternate Approach:

    Combine NLP-Based Models with Azure AI:

    If Azure AI struggles, use Azure Machine Learning (AML) with NLP models like BERT or ClinicalBERT to extract medical entities with higher accuracy.

    Integrate Azure Cognitive Search to index and retrieve structured treatment data.

    Hope the above steps help to resolve your issue, if you have any further queries do let us know

    Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.