Thanks for reaching out to Microsoft Q&A.
To train a model that can effectively extract information from various types of student academic transcripts, particularly given the challenges you're facing with cross-page labeling and inconsistent extraction using pre-built models, here are a few approaches you can consider:
- Custom Model with Azure Form Recognizer (Document Intelligence)
Labeling Strategy: While cross-page labeling is a limitation, you can break the document into individual pages and label the fields on each page, then combine the extracted data at a post-processing stage. This approach requires additional logic but can help with multi-page transcripts.
Train on Specific Sections: Instead of training the model on the entire transcript, identify common sections across various transcript formats (e.g., student details, courses, grades) and train separate models for each section.
Boosting Accuracy: If the "Document" or "Read" pre-built models are not extracting key data accurately, using a custom-trained model with a well-labeled dataset might yield better results. You can add redundancy in your training set by including transcripts with various layouts and structures to improve extraction across formats.
- Combining Pre-built Models with Custom Post-Processing
Layering Pre-built Models: You can use the "Read" or "Layout" model as a first step to extract basic content (text, tables, and structures) from the transcripts. Then, apply custom logic to process the extracted data (ex: regex or NLP methods) to identify key fields like student name, course codes, or grades.
Custom Script-Based Solutions: If certain patterns are consistently missed by the models (e.g., grade formatting), consider writing custom scripts or regex-based extraction rules tailored to the structure of the missing data.
- Document Intelligence Custom Neural Model
Explore Azure's Neural Custom Model capabilities if you haven't already. It allows for field tagging across complex document layouts and can be more adaptable to different transcript formats. Although this requires more substantial training data, the model can potentially generalize better across different formats of transcripts compared to rule-based approaches.
- Use an Ensemble of Models
You could also consider using multiple models in parallel, where each model is trained to extract certain specific parts of the transcript, followed by combining the extracted results. This reduces the reliance on a single model getting everything right and gives more flexibility.
- Form Recognizer Pre-built Models
If you prefer to stick with pre-built models, the "Layout" model from Azure Document Intelligence might be better suited for handling complex documents where tabular data (e.g., grades) is involved. The "Invoice" model can sometimes perform well for line-item type extractions (like courses and grades) and might be worth experimenting with.
By training custom models and combining pre-built models with custom post-processing, you can improve extraction accuracy for varied transcript formats.
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.