Table extraction with custom neural model given varying data types in user input and, by consequence, in train data

jum 20 Reputation points
2025-06-02T15:38:47.15+00:00

Hi there,

we are implementing a processing pipeline which extracts info from various different order forms. As each form already comes with a critical amount of variation with respect to layout and fields to be extracted (due to existing parallel versions), we decided on training one custom neural model per form. The individual models will be composed.

For field types string and selection marks, everything works pretty good. We are running into issues when extracting info from tables. One of the tables in question holds data that captures quantities per class. This table is of prior importance as it occurs across forms. Real user data shows that some users input numbers (as expected quantity per class), others just use crosses and treat table cells as checkboxes. In the latter case, extraction performs poorly. Questions I have:

  • Is the overall approach (composed model) sound?
  • What is the best approach for custom table extraction? For the table mentioned above, is it possible to have a flexible fieldType? In fields.json in definitions, currently each field shows up as type string. I assume that this causes the poor performance.
  • In my understanding, the model does not seem to learn that input data can be selection mark OR a number. So, labelling examples of each realization has no effect? Or would it be more reasonable to break the table structure into fields when labelling and ignore the table layout itself - assuming that the model learns concurrent data types then?

I am training the neural models with GA 4.0. I cannot disclose any custom data. A simplified mock-up of the table would look like this. As stated above, instead of the numbers, there could be crosses recognized as selectionmarks. Also note, that users only fill in a subset of the available cells.Bildschirmfoto vom 2025-06-02 17-32-57

Thanks in advance :)

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,103 questions
0 comments No comments
{count} votes

Accepted answer
  1. Jerald Felix 1,630 Reputation points
    2025-06-02T16:25:06.23+00:00

    Hello jum,

    Yes. When every form family has its own layout quirks, the recommended pattern is to train a separate custom neural model for each family and then compose them into a single endpoint. The Compose feature performs an automatic “best fit” classification at run-time and currently supports up to 200 child models per composed model.

    Why the table with “numbers or check-boxes” is tricky

    • Custom neural v4.0 understands three data types: key-value pairs (string/number/date), selection-marks, and tabular fields. It does not let the same field switch type between samples.
    • Inside a table every cell value is ultimately treated as text. If a user draws a check-box the Layout engine emits the Unicode symbols ☒ / ☐ (selected / unselected).

    When your fields.json says the column is string, cells that contain only a selection-mark end up as empty strings, so the model gets no signal and accuracy drops.


    Pattern that works well in production

    Step What to do Why it helps
    1 – Pick a single field type per column Either keep the column as string and treat “☒” as “1” in post-processing, or split the logic into two separate fields (qtyNumber, qtyChecked). Keeps the label schema consistent so the network converges.
    1 – Pick a single field type per column Either keep the column as string and treat “☒” as “1” in post-processing, or split the logic into two separate fields (qtyNumber, qtyChecked). Keeps the label schema consistent so the network converges.
    2 – Label both variants Include at least 5 docs with numbers and 5 docs with check-boxes in the same training set. Lets the neural model learn both visual patterns.
    3 – If you must keep the table semantic but also detect the mark Use the new overlapping-fields capability (v4.0, 2024-11-30 API): label the cell once as part of the table and overlay a selection field on the same tokens. Remember the limits – two overlapping fields max and they can’t span multiple rows Provides the best of both worlds: structured rows plus a boolean field you can map to “quantity = 1”.
    4 – Fixed vs. dynamic table If the column/row count never changes, choose Fixed Table; otherwise label as Dynamic Table so variable row counts don’t hurt recall. Gives the model the right structural prior.
    5 – Post-process for business meaning Convert - number ⇒ quantity - selected mark ⇒ quantity = 1 - empty cell ⇒ quantity = 0 Keeps the extraction model generic while business rules live in code.

    Does adding more labels help?

    Only if they follow the rules above. Simply mixing “123” and “☒” in the same field without consistent typing will keep the network confused and you will continue to see empty cells for the mark variant.


    Quick checklist

    Schema – verify every field in fields.json has exactly one type.

    Samples – balanced dataset: ≥ 5 pages per visual variant (numbers vs. marks).

    1. API version – train and run with 2024-11-30 (v4.0 GA) so that tabular fields, cell-confidence and overlapping-fields are all enabled.

    Compose rebuilt – after re-training a child model, remember to re-compose it so the new extractor is available through the composed model.


    TL;DR

    Your composed-model architecture is solid. For the “quantity or check-box” column either (a) treat everything as text and map “☒” later, or (b) split the information into two fields using the v4.0 overlapping-fields feature. What you cannot do is let one field randomly switch between string and selection-mark across documents — Document Intelligence will treat the mismatched samples as “missing” and accuracy will suffer.

    Hope that clears things up!

    Best Regards,

    Jerald Felix

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.