When training a custom extraction model, how should column names be labeled to ensure accurate recognition during testing?

Question

I have a collection of documents containing various tables. When attempting to label the column names, direct labeling is not possible. Initially, we must insert the column names in the first row (i.e., #0), and then use this as a template to rename the columns. However, when training a model with this data and later testing it with new documents, the model doesn't recognize the column names from the test document if their structure matches the training data. Instead, it defaults to the column names from the training document. Could someone advise on how to accurately label the column names so that the model recognizes the correct column names from the test documents?

Answer

@neha_b I see you have named your column name same as your the column name in your document. This means all your documents that extract the table will have the same column name. If you expect the column names to change, provide a generic column name that can be easily identified from your extracted response.

Also, it seems like you have used a dynamic table tag where the column name can only be named, if you have dynamic tables i.e tables with varying row sizes in your documents then this is the best approach. If your tables are fixed in row and column sizes used a fixed table for your table tag and name the column and rows.

Now, for the final labeling of cells during training, you don't have to label every table from your form with a table tag and your table tags don't have to replicate the structure of very table found in your form. Tables extracted automatically by Document Intelligence will be included in the pageResults section of the JSON output.

This means table tags are helpful if you want to extract items as a table from your form, they need not necessarily be part of a table. For instance, your form has a list of people, and includes, a first name, a last name, and an email address. You would like to extract this information. In this case, you could use a table tag with first name, last name, and email address as columns and each row is populated with information about a person from your list.

Please review the guidance on usage of table tags and try to create a new model with relevant tags to extract information from the form.

Share via

When training a custom extraction model, how should column names be labeled to ensure accurate recognition during testing?

1 answer