Custom Extraction Model not extracting correct data after training

max 20 Reputation points
2024-12-26T03:28:01.5033333+00:00

Process-food-invoices-Sysco-Wasabi

Heya there, I am training a custom model to extract info from a sysco food order invoice. The problem I'm having is that they put values incredibly close together for different columns and the model is having trouble parsing them as separate columns. I went through and individually demarcated the values and put them in table format and went ahead and trained the model but it is still having the same issue. Is there anything I can do to fix this? (Do I need more training files? currently have 5 or potentially some hidden feature that I am missing?). I am attaching a picture of a Sysco invoice so you can see what I mean. The main problem is regarding the Pack and Size columns, where it associates them as one value or incorrectly splits them. Also sometimes Item Description and Size bleed into one another. Like I said I trained a model to learn to discern them as separate but it did not work. Any help would be greatly appreciated.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
{count} votes

Accepted answer
  1. Chakaravarthi Rangarajan Bhargavi 1,115 Reputation points MVP
    2024-12-26T04:41:18.3533333+00:00

    Hi max, Greetings!

    Welcome to Microsoft Q&A.

    It sounds like you’re facing a challenge with extracting data from Sysco food order invoices, particularly with tightly packed columns like "Pack" and "Size" that are causing the model to misinterpret or combine values. Here’s a structured approach to resolve these issues using the various Microsoft AI tools and resources available.

    1. Use the Prebuilt Invoice Model

    The Prebuilt Invoice Model from Microsoft Syntex is designed to extract structured data from invoices automatically. You can test this model to see if it handles the layout of Sysco invoices well. The prebuilt model may already process common fields like "Item Description" and "Total Amount" effectively, but you can always combine its output with a custom approach if it doesn't fully meet your needs.

    1. Improve Training Data and Preprocessing

    To ensure your model accurately processes invoices, you can improve your training data and preprocessing techniques. Follow the guidelines in Improving Form Processing Performance to optimize the performance of your custom model. Key recommendations include:

    • Enhancing Labeling: Make sure the fields such as "Pack" and "Size" are clearly labeled with precise bounding boxes in your training data similar to the image below User's image
    • Increasing Training Data: Aim for a diverse dataset to capture various invoice formats, ideally with more than five training examples.
    • Whitespace Adjustments: Preprocess the images to artificially add more spacing between tightly packed columns to help the model better distinguish between them.
    1. Leverage Entity Extraction for Postprocessing

    Once your data is extracted using the form processing model, you can use the Entity Extraction feature in AI Builder to refine the output. Entity extraction can help you accurately pull out specific pieces of information, such as the "Pack" and "Size" fields, from unstructured or semi-structured text. Here’s how:

    • Define Custom Entities: For tightly packed fields, define "Pack" and "Size" as separate entities and train the model to recognize their patterns (e.g., numeric values followed by unit descriptors like "oz" or "lb").
    • Postprocess Data: After extracting raw text from the invoice, run a postprocessing step to separate incorrectly combined values based on learned patterns.
    1. Create Custom Extractors

    If the prebuilt models don’t meet your specific requirements, you can create custom extractors using Microsoft Syntex Custom Extractors. This gives you more control over how data is extracted, especially for fields like "Pack" and "Size" that require specific attention due to their proximity in the layout.

    By combining these approaches:

    • Test the prebuilt invoice model to see if it handles your layout well.
    • Improve training data using best practices for preprocessing and annotation.
    • Use entity extraction to handle tightly packed columns like "Pack" and "Size."
    • If needed, create custom extractors for even finer control over data extraction.

    With these strategies, you should be able to significantly improve the accuracy and efficiency of your model in processing Sysco invoices.

    For more detailed steps, follow the links:

    I hope this helps! If you have further questions or need more assistance, feel free to ask.

    If you found this response helpful, please accept the answer.

    Regards,

    Chakaravarthi Rangarajan Bhargavi


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.