Document Intelligence custom classification model
Important
- Document Intelligence public preview releases provide early access to features that are in active development.
- Features, approaches, and processes may change, prior to General Availability (GA), based on user feedback.
- The public preview version of Document Intelligence client libraries default to REST API version 2023-10-31-preview.
This content applies to: v4.0 (preview) | Previous version:
v3.1 (GA)
This content applies to: v3.1 (GA) | Latest version:
v4.0 (preview)
Important
- Starting with the
2023-10-31-preview
API, analyzing documents with the custom classification model won't split documents by default. - You need to explicitly set the
splitMode
property to auto to preserve the behavior from previous releases. The default forsplitMode
isnone
. - If your input file contains multiple documents, you need to enable splitting by setting the
splitMode
toauto
.
Custom classification models are deep-learning-model types that combine layout and language features to accurately detect and identify documents you process within your application. Custom classification models perform classification of an input file one page at a time to identify the document(s) within and can also identify multiple documents or multiple instances of a single document within an input file.
Model capabilities
Custom classification models can analyze a single- or multi-file documents to identify if any of the trained document types are contained within an input file. Here are the currently supported scenarios:
A single file containing one document. For instance, a loan application form.
A single file containing multiple documents. For instance, a loan application package containing a loan application form, payslip, and bank statement.
A single file containing multiple instances of the same document. For instance, a collection of scanned invoices.
✔️ Training a custom classifier requires at least two
distinct classes and a minimum of five
document samples per class. The model response contains the page ranges for each of the classes of documents identified.
✔️ The maximum allowed number of classes is 500
. The maximum allowed number of document samples per class is 100
.
The model classifies each page of the input document to one of the classes in the labeled dataset. Use the confidence score from the response to set the threshold for your application.
Compare custom classification and composed models
A custom classification model can replace a composed model in some scenarios but there are a few differences to be aware of:
Capability | Custom classifier process | Composed model process |
---|---|---|
Analyze a single document of unknown type belonging to one of the types trained for extraction model processing. | ● Requires multiple calls. ● Call the classification model based on the document class. This step allows for a confidence-based check before invoking the extraction model analysis. ● Invoke the extraction model. |
● Requires a single call to a composed model containing the model corresponding to the input document type. |
Analyze a single document of unknown type belonging to several types trained for extraction model processing. | ●Requires multiple calls. ● Make a call to the classifier that ignores documents not matching a designated type for extraction. ● Invoke the extraction model. |
● Requires a single call to a composed model. The service selects a custom model within the composed model with the highest match. ● A composed model can't ignore documents. |
Analyze a file containing multiple documents of known or unknown type belonging to one of the types trained for extraction model processing. | ● Requires multiple calls. ● Call the extraction model for each identified document in the input file. ● Invoke the extraction model. |
● Requires a single call to a composed model. ● The composed model invokes the component model once on the first instance of the document. ●The remaining documents are ignored. |
Language support
Classification models currently only support English language documents.
Input requirements
For best results, provide one clear photo or high-quality scan per document.
Supported file formats:
Model PDF Image:
JPEG/JPG, PNG, BMP, TIFF, HEIFMicrosoft Office:
Word (DOCX), Excel (XLSX), PowerPoint (PPTX), and HTMLRead ✔ ✔ ✔ Layout ✔ ✔ ✔ (2023-10-31-preview) General Document ✔ ✔ Prebuilt ✔ ✔ Custom ✔ ✔ ✱ Microsoft Office files are currently not supported for other models or versions.
For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).
The file size for analyzing documents is 500 MB for paid (S0) tier and 4 MB for free (F0) tier.
Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.
If your PDFs are password-locked, you must remove the lock before submission.
The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about
8
-point text at 150 dots per inch (DPI).For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.
For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.
For custom classification model training, the total size of training data is
1GB
with a maximum of 10,000 pages.
Best practices
Custom classification models require a minimum of five samples per class to train. If the classes are similar, adding extra training samples improves model accuracy.
Training a model
Custom classification models are supported by v4.0:2023-10-31-preview and v3.1:2023-07-31 (GA) APIs. Document Intelligence Studio provides a no-code user interface to interactively train a custom classifier.
When using the REST API, if you organize your documents by folders, you can use the azureBlobSource
property of the request to train a classification model.
https://{endpoint}/documentintelligence/documentClassifiers:build?api-version=2023-10-31-preview
{
"classifierId": "demo2.1",
"description": "",
"docTypes": {
"car-maint": {
"azureBlobSource": {
"containerUrl": "SAS URL to container",
"prefix": "sample1/car-maint/"
}
},
"cc-auth": {
"azureBlobSource": {
"containerUrl": "SAS URL to container",
"prefix": "sample1/cc-auth/"
}
},
"deed-of-trust": {
"azureBlobSource": {
"containerUrl": "SAS URL to container",
"prefix": "sample1/deed-of-trust/"
}
}
}
}
https://{endpoint}/formrecognizer/documentClassifiers:build?api-version=2023-07-31
{
"classifierId": "demo2.1",
"description": "",
"docTypes": {
"car-maint": {
"azureBlobSource": {
"containerUrl": "SAS URL to container",
"prefix": "sample1/car-maint/"
}
},
"cc-auth": {
"azureBlobSource": {
"containerUrl": "SAS URL to container",
"prefix": "sample1/cc-auth/"
}
},
"deed-of-trust": {
"azureBlobSource": {
"containerUrl": "SAS URL to container",
"prefix": "sample1/deed-of-trust/"
}
}
}
}
Alternatively, if you have a flat list of files or only plan to use a few select files within each folder to train the model, you can use the azureBlobFileListSource
property to train the model. This step requires a file list
in JSON Lines format. For each class, add a new file with a list of files to be submitted for training.
{
"classifierId": "demo2",
"description": "",
"docTypes": {
"car-maint": {
"azureBlobFileListSource": {
"containerUrl": "SAS URL to container",
"fileList": "sample1/car-maint.jsonl"
}
},
"cc-auth": {
"azureBlobFileListSource": {
"containerUrl": "SAS URL to container",
"fileList": "sample1/cc-auth.jsonl"
}
},
"deed-of-trust": {
"azureBlobFileListSource": {
"containerUrl": "SAS URL to container",
"fileList": "sample1/deed-of-trust.jsonl"
}
}
}
}
File list car-maint.jsonl
contains the following files.
{"file":"sample1/car-maint/Commercial Motor Vehicle - Adatum.pdf"}
{"file":"sample1/car-maint/Commercial Motor Vehicle - Fincher.pdf"}
{"file":"sample1/car-maint/Commercial Motor Vehicle - Lamna.pdf"}
{"file":"sample1/car-maint/Commercial Motor Vehicle - Liberty.pdf"}
{"file":"sample1/car-maint/Commercial Motor Vehicle - Trey.pdf"}
Model response
Analyze an input file with the document classification model
https://{endpoint}/documentintelligence/documentClassifiers:build?api-version=2023-10-31-preview
https://{service-endpoint}/formrecognizer/documentClassifiers/{classifier}:analyze?api-version=2023-07-31
The response contains the identified documents with the associated page ranges in the documents section of the response.
{
...
"documents": [
{
"docType": "formA",
"boundingRegions": [
{ "pageNumber": 1, "polygon": [...] },
{ "pageNumber": 2, "polygon": [...] }
],
"confidence": 0.97,
"spans": []
},
{
"docType": "formB",
"boundingRegions": [
{ "pageNumber": 3, "polygon": [...] }
],
"confidence": 0.97,
"spans": []
}, ...
]
}
Next steps
Learn to create custom classification models:
Feedback
Submit and view feedback for