Is it possible to train a model against different students' academic transcripts for extracting information?

Elangovan, Nandha 40 Reputation points
2024-09-23T05:31:03.6533333+00:00

I am trying to train a custom model which can process and extract information from different types of student academic transcripts. I am finding it difficult since cross page labeling is not supported in the document intelligence now. Also, pre-build model like read or document model does not get all the content correctly from the PDF while working with different formats.

So, is there a better approach for training a model against different types of transcripts for extracting information or which prebuild model is best suited for it?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,685 questions
0 comments No comments
{count} votes

Accepted answer
  1. Vinodh247 21,881 Reputation points
    2024-09-23T06:53:02.05+00:00

    Hi Elangovan, Nandha,

    Thanks for reaching out to Microsoft Q&A.

    To train a model that can effectively extract information from various types of student academic transcripts, particularly given the challenges you're facing with cross-page labeling and inconsistent extraction using pre-built models, here are a few approaches you can consider:

    1. Custom Model with Azure Form Recognizer (Document Intelligence)

    Labeling Strategy: While cross-page labeling is a limitation, you can break the document into individual pages and label the fields on each page, then combine the extracted data at a post-processing stage. This approach requires additional logic but can help with multi-page transcripts.

    Train on Specific Sections: Instead of training the model on the entire transcript, identify common sections across various transcript formats (e.g., student details, courses, grades) and train separate models for each section.

    Boosting Accuracy: If the "Document" or "Read" pre-built models are not extracting key data accurately, using a custom-trained model with a well-labeled dataset might yield better results. You can add redundancy in your training set by including transcripts with various layouts and structures to improve extraction across formats.

    1. Combining Pre-built Models with Custom Post-Processing

    Layering Pre-built Models: You can use the "Read" or "Layout" model as a first step to extract basic content (text, tables, and structures) from the transcripts. Then, apply custom logic to process the extracted data (ex: regex or NLP methods) to identify key fields like student name, course codes, or grades.

    Custom Script-Based Solutions: If certain patterns are consistently missed by the models (e.g., grade formatting), consider writing custom scripts or regex-based extraction rules tailored to the structure of the missing data.

    1. Document Intelligence Custom Neural Model

    Explore Azure's Neural Custom Model capabilities if you haven't already. It allows for field tagging across complex document layouts and can be more adaptable to different transcript formats. Although this requires more substantial training data, the model can potentially generalize better across different formats of transcripts compared to rule-based approaches.

    1. Use an Ensemble of Models

    You could also consider using multiple models in parallel, where each model is trained to extract certain specific parts of the transcript, followed by combining the extracted results. This reduces the reliance on a single model getting everything right and gives more flexibility.

    1. Form Recognizer Pre-built Models

    If you prefer to stick with pre-built models, the "Layout" model from Azure Document Intelligence might be better suited for handling complex documents where tabular data (e.g., grades) is involved. The "Invoice" model can sometimes perform well for line-item type extractions (like courses and grades) and might be worth experimenting with.

    By training custom models and combining pre-built models with custom post-processing, you can improve extraction accuracy for varied transcript formats.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.


1 additional answer

Sort by: Most helpful
  1. sam ganjoriyan 0 Reputation points
    2024-09-23T09:44:16.83+00:00

    It should definitely be implemented from designs with different colors and prioritized based on their value and time

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.