Table extraction with custom neural model given varying data types in user input and, by consequence, in train data

Question

Table extraction with custom neural model given varying data types in user input and, by consequence, in train data

jum 20

Hi there,

we are implementing a processing pipeline which extracts info from various different order forms. As each form already comes with a critical amount of variation with respect to layout and fields to be extracted (due to existing parallel versions), we decided on training one custom neural model per form. The individual models will be composed.

For field types string and selection marks, everything works pretty good. We are running into issues when extracting info from tables. One of the tables in question holds data that captures quantities per class. This table is of prior importance as it occurs across forms. Real user data shows that some users input numbers (as expected quantity per class), others just use crosses and treat table cells as checkboxes. In the latter case, extraction performs poorly. Questions I have:

Is the overall approach (composed model) sound?
What is the best approach for custom table extraction? For the table mentioned above, is it possible to have a flexible fieldType? In fields.json in definitions, currently each field shows up as type string. I assume that this causes the poor performance.
In my understanding, the model does not seem to learn that input data can be selection mark OR a number. So, labelling examples of each realization has no effect? Or would it be more reasonable to break the table structure into fields when labelling and ignore the table layout itself - assuming that the model learns concurrent data types then?

I am training the neural models with GA 4.0. I cannot disclose any custom data. A simplified mock-up of the table would look like this. As stated above, instead of the numbers, there could be crosses recognized as selectionmarks. Also note, that users only fill in a subset of the available cells. Bildschirmfoto vom 2025-06-02 17-32-57

Thanks in advance :)

Accepted answer

0 additional answers

Your answer

Answer 1

Hello jum,

Yes. When every form family has its own layout quirks, the recommended pattern is to train a separate custom neural model for each family and then compose them into a single endpoint. The Compose feature performs an automatic “best fit” classification at run-time and currently supports up to 200 child models per composed model.

Why the table with “numbers or check-boxes” is tricky

Custom neural v4.0 understands three data types: key-value pairs (string/number/date), selection-marks, and tabular fields. It does not let the same field switch type between samples.
Inside a table every cell value is ultimately treated as text. If a user draws a check-box the Layout engine emits the Unicode symbols ☒ / ☐ (selected / unselected).

When your fields.json says the column is string, cells that contain only a selection-mark end up as empty strings, so the model gets no signal and accuracy drops.

Pattern that works well in production

Step	What to do	Why it helps
1 – Pick a single field type per column	Either keep the column as string and treat “☒” as “1” in post-processing, or split the logic into two separate fields (qtyNumber, qtyChecked).	Keeps the label schema consistent so the network converges.
1 – Pick a single field type per column	Either keep the column as `string` and treat “☒” as “1” in post-processing, or split the logic into two separate fields (`qtyNumber`, `qtyChecked`).	Keeps the label schema consistent so the network converges.
2 – Label both variants	Include at least 5 docs with numbers and 5 docs with check-boxes in the same training set.	Lets the neural model learn both visual patterns.
3 – If you must keep the table semantic but also detect the mark	Use the new overlapping-fields capability (v4.0, 2024-11-30 API): label the cell once as part of the table and overlay a `selection` field on the same tokens. Remember the limits – two overlapping fields max and they can’t span multiple rows	Provides the best of both worlds: structured rows plus a boolean field you can map to “quantity = 1”.
4 – Fixed vs. dynamic table	If the column/row count never changes, choose Fixed Table; otherwise label as Dynamic Table so variable row counts don’t hurt recall.	Gives the model the right structural prior.
5 – Post-process for business meaning	Convert - number ⇒ quantity - selected mark ⇒ quantity = 1 - empty cell ⇒ quantity = 0	Keeps the extraction model generic while business rules live in code.

Does adding more labels help?

Only if they follow the rules above. Simply mixing “123” and “☒” in the same field without consistent typing will keep the network confused and you will continue to see empty cells for the mark variant.

Quick checklist

Schema – verify every field in fields.json has exactly one type.

Samples – balanced dataset: ≥ 5 pages per visual variant (numbers vs. marks).

API version – train and run with 2024-11-30 (v4.0 GA) so that tabular fields, cell-confidence and overlapping-fields are all enabled.

Compose rebuilt – after re-training a child model, remember to re-compose it so the new extractor is available through the composed model.

TL;DR

Your composed-model architecture is solid. For the “quantity or check-box” column either (a) treat everything as text and map “☒” later, or (b) split the information into two fields using the v4.0 overlapping-fields feature. What you cannot do is let one field randomly switch between string and selection-mark across documents — Document Intelligence will treat the mismatched samples as “missing” and accuracy will suffer.

Hope that clears things up!

Best Regards,

Jerald Felix

jum 20 Reputation points

2025-06-03T08:01:56.5766667+00:00

Hi @Jerald Felix

oh wow! Thanks for your insightful reply. Just to clarify: Sticking to Step 1 you mentioned, i.e. picking a single field type per column and splitting the logic into two separate fields (qtyNumber, qtyChecked):

This would mean that I have a twin column for each column that comes with a different fieldType? So, for qtyChecked I would then set the type to selectionMark so that I do not end up with empty strings? Post-processing would then map :unselected: to 0 and :selected: to 1?

Is a fixed table still the way to go if I introduce these artificial helper columns?

When retyping/retraining: qtyNumber should probably be typed to iteger, not string? I have to admit, that I ignored the option to type columns up until now :)

The alternative approach to use overlapping fields appears to be a good alternative. Did not think about this approach. Definitely going to give it a try.

Thanks,

Julia
Jerald Felix 1,630 Reputation points

2025-06-03T16:59:29.76+00:00

Hi Julia,

You're absolutely right — if you're splitting the logic into two separate fields (like qtyNumber and qtyChecked), then each column would effectively have its own distinct field type to align with the expected data (e.g., number vs. selection mark).

For qtyChecked, setting the field type to selectionMark is a great idea. This avoids issues like empty strings or incorrect extractions and lets you rely on structured outputs like :selected: or :unselected: which, as you mentioned, can be post-processed into 1 and 0 respectively.

For qtyNumber, yes — typing it as integer (or number if decimals are expected) is the recommended way. This not only improves model accuracy but also simplifies downstream validation and parsing logic. Assigning proper data types during model training makes a significant difference in extraction quality.

Regarding fixed tables: You can still use them effectively even with the "helper" columns. However, if the number of such helper columns grows or the document layout varies, a dynamic table or key-value pair structure might be more flexible. It depends on whether the document structure is consistent across samples.

Overlapping fields are indeed a powerful alternative, especially when column content isn't strictly typed or when visual clues overlap (like checkboxes and numeric values in the same region). Just ensure the labeling is clear and consistent for both fields in your training set.

Glad to hear you're exploring both paths — let me know how it goes or if you need help tweaking your labeling setup!

Best Regards,

Jerald Felix

Share via

Table extraction with custom neural model given varying data types in user input and, by consequence, in train data

0 additional answers

Your answer