Is it possible to train a model against different students' academic transcripts for extracting information?

Question

Is it possible to train a model against different students' academic transcripts for extracting information?

Elangovan, Nandha 45

I am trying to train a custom model which can process and extract information from different types of student academic transcripts. I am finding it difficult since cross page labeling is not supported in the document intelligence now. Also, pre-build model like read or document model does not get all the content correctly from the PDF while working with different formats.

So, is there a better approach for training a model against different types of transcripts for extracting information or which prebuild model is best suited for it?

Accepted answer

1 additional answer

Your answer

Answer 1

Hi Elangovan, Nandha,

Thanks for reaching out to Microsoft Q&A.

To train a model that can effectively extract information from various types of student academic transcripts, particularly given the challenges you're facing with cross-page labeling and inconsistent extraction using pre-built models, here are a few approaches you can consider:

Custom Model with Azure Form Recognizer (Document Intelligence)

Labeling Strategy: While cross-page labeling is a limitation, you can break the document into individual pages and label the fields on each page, then combine the extracted data at a post-processing stage. This approach requires additional logic but can help with multi-page transcripts.

Train on Specific Sections: Instead of training the model on the entire transcript, identify common sections across various transcript formats (e.g., student details, courses, grades) and train separate models for each section.

Boosting Accuracy: If the "Document" or "Read" pre-built models are not extracting key data accurately, using a custom-trained model with a well-labeled dataset might yield better results. You can add redundancy in your training set by including transcripts with various layouts and structures to improve extraction across formats.

Combining Pre-built Models with Custom Post-Processing

Layering Pre-built Models: You can use the "Read" or "Layout" model as a first step to extract basic content (text, tables, and structures) from the transcripts. Then, apply custom logic to process the extracted data (ex: regex or NLP methods) to identify key fields like student name, course codes, or grades.

Custom Script-Based Solutions: If certain patterns are consistently missed by the models (e.g., grade formatting), consider writing custom scripts or regex-based extraction rules tailored to the structure of the missing data.

Document Intelligence Custom Neural Model

Explore Azure's Neural Custom Model capabilities if you haven't already. It allows for field tagging across complex document layouts and can be more adaptable to different transcript formats. Although this requires more substantial training data, the model can potentially generalize better across different formats of transcripts compared to rule-based approaches.

Use an Ensemble of Models

You could also consider using multiple models in parallel, where each model is trained to extract certain specific parts of the transcript, followed by combining the extracted results. This reduces the reliance on a single model getting everything right and gives more flexibility.

Form Recognizer Pre-built Models

If you prefer to stick with pre-built models, the "Layout" model from Azure Document Intelligence might be better suited for handling complex documents where tabular data (e.g., grades) is involved. The "Invoice" model can sometimes perform well for line-item type extractions (like courses and grades) and might be worth experimenting with.

By training custom models and combining pre-built models with custom post-processing, you can improve extraction accuracy for varied transcript formats.

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Elangovan, Nandha 45 Reputation points

2024-09-23T09:25:00.7533333+00:00

Thanks a lot for your suggestions. Let me try it out.
Vinodh247 34,666 Reputation points MVP Volunteer Moderator

2024-09-23T09:43:33.2+00:00

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.
sam ganjoriyan 0 Reputation points

2024-09-23T09:43:34.87+00:00

It should definitely be implemented from designs with different colors and prioritized based on their value and time

Answer 2

sam ganjoriyan 0

It should definitely be implemented from designs with different colors and prioritized based on their value and time

Share via

Is it possible to train a model against different students' academic transcripts for extracting information?

1 additional answer

Your answer