Hi max, Greetings!
Welcome to Microsoft Q&A.
It sounds like you’re facing a challenge with extracting data from Sysco food order invoices, particularly with tightly packed columns like "Pack" and "Size" that are causing the model to misinterpret or combine values. Here’s a structured approach to resolve these issues using the various Microsoft AI tools and resources available.
- Use the Prebuilt Invoice Model
The Prebuilt Invoice Model from Microsoft Syntex is designed to extract structured data from invoices automatically. You can test this model to see if it handles the layout of Sysco invoices well. The prebuilt model may already process common fields like "Item Description" and "Total Amount" effectively, but you can always combine its output with a custom approach if it doesn't fully meet your needs.
- Improve Training Data and Preprocessing
To ensure your model accurately processes invoices, you can improve your training data and preprocessing techniques. Follow the guidelines in Improving Form Processing Performance to optimize the performance of your custom model. Key recommendations include:
- Enhancing Labeling: Make sure the fields such as "Pack" and "Size" are clearly labeled with precise bounding boxes in your training data similar to the image below
- Increasing Training Data: Aim for a diverse dataset to capture various invoice formats, ideally with more than five training examples.
- Whitespace Adjustments: Preprocess the images to artificially add more spacing between tightly packed columns to help the model better distinguish between them.
- Leverage Entity Extraction for Postprocessing
Once your data is extracted using the form processing model, you can use the Entity Extraction feature in AI Builder to refine the output. Entity extraction can help you accurately pull out specific pieces of information, such as the "Pack" and "Size" fields, from unstructured or semi-structured text. Here’s how:
- Define Custom Entities: For tightly packed fields, define "Pack" and "Size" as separate entities and train the model to recognize their patterns (e.g., numeric values followed by unit descriptors like "oz" or "lb").
- Postprocess Data: After extracting raw text from the invoice, run a postprocessing step to separate incorrectly combined values based on learned patterns.
- Create Custom Extractors
If the prebuilt models don’t meet your specific requirements, you can create custom extractors using Microsoft Syntex Custom Extractors. This gives you more control over how data is extracted, especially for fields like "Pack" and "Size" that require specific attention due to their proximity in the layout.
By combining these approaches:
- Test the prebuilt invoice model to see if it handles your layout well.
- Improve training data using best practices for preprocessing and annotation.
- Use entity extraction to handle tightly packed columns like "Pack" and "Size."
- If needed, create custom extractors for even finer control over data extraction.
With these strategies, you should be able to significantly improve the accuracy and efficiency of your model in processing Sysco invoices.
For more detailed steps, follow the links:
- Prebuilt Invoice Model
- Improving Form Processing Performance
- Entity Extraction Overview
- Creating Custom Extractors
I hope this helps! If you have further questions or need more assistance, feel free to ask.
If you found this response helpful, please accept the answer.
Regards,
Chakaravarthi Rangarajan Bhargavi