Content Understanding: issues recognizing grouped line items in invoices

Question

Content Understanding: issues recognizing grouped line items in invoices

Alexandre Perebaskine 20

Hello,

I'm currently trying to use Content Understanding to extract data from invoices. The issue I'm running into is that many of those invoices are consolidated and contain references to multiple delivery notes, and group one or more line item below a delivery note reference. Moreover, the delivery note references are often found inside the description field of line items instead of a separate column.

Here's an example:

User's image

I am finding that I cannot get the analyzer to reliably associate the delivery note info to the corresponding line items. This is especially true where a line item group spans multiple pages (worse when the delivery note info is on the last line of a page, with every associated line item being in the following pages). Adding labeled data doesn't seem to affect the output.

Currently, the only thing I can get to work reliably is to extract delivery note IDs separately from the line items. I am thinking I should be able to use the positional data returned by the analyzer in my own backend logic to associate the delivery note numbers with the relevant line items.

I'm also finding that processing large invoices (5 or more pages) fails when using a custom schema unless said schema is very lightweight. prebuilt-invoice seems to work fine. I was thinking of running prebuilt-invoice, then a custom schema for the extra data that I need, although that has the downside of incurring extra costs from running multiple analyses per invoice.

Do I have the right idea here, or am I missing something?

Thank you.

0 comments

Answer accepted by question author

0 additional answers

Your answer

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

Content Understanding will not reliably infer parentchild relationships when the grouping key (delivery note) is embedded inside the description or appears across page breaks. Prebuilt models handle layout reasonably, but custom schemas break far sooner because they enforce structure that simply is not present in the document.

Expanded:

Grouping by delivery note inside CU is unreliable The model cannot reliably segment table “sub-groups” when the only signal is free text inside descriptions, particularly spanning pages. Labeled data helps detect fields, but it does not teach the model to interpret implicit grouping logic.
Extracting delivery notes separately + using coordinates is valid Using line bounding boxes and Y position logic server-side is the most reliable method today. Essentially:

OCR -> extract lines + positions

Track the last seen delivery note above the row

Assign it to subsequent rows until another note appears

This works even across page boundaries.

Prebuilt + custom is a real trade-off Prebuilt invoice will scale better on long documents because it leans on layout heuristics and dynamic tables instead of strict schemas. Running a second custom model for additional fields is pragmatic, and yes it costs more, but it saves you from schema failures.

Alternative options if scale becomes painful

Split pages and process in batches, then recombine logic yourself

Switch to Azure Document Intelligence with the layout model + your own post-processing

Build a small ML/NLP classifier to detect lines containing delivery note references

You are exactly on the right track. Treat CU like a text extractor + entity detector, and handle grouping logic yourself. Trying to force the model to infer grouping rules the document does not explicitly encode is a dead end.

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Share via

Content Understanding: issues recognizing grouped line items in invoices

0 additional answers

Your answer