Document Intelligence custom classification model

Important

  • Document Intelligence public preview releases provide early access to features that are in active development.
  • Features, approaches, and processes may change, prior to General Availability (GA), based on user feedback.
  • The public preview version of Document Intelligence client libraries default to REST API version 2023-10-31-preview.
  • Public preview version 2023-10-31-preview is currently only available in the following Azure regions:
  • East US
  • West US2
  • West Europe

This content applies to:checkmark v4.0 (preview) | Previous version: blue-checkmark v3.1 (GA)

This content applies to: checkmark v3.1 (GA) | Latest version: purple-checkmark v4.0 (preview)

Important

  • Starting with the 2023-10-31-preview API, analyzing documents with the custom classification model won't split documents by default.
  • You need to explicitly set the splitMode property to auto to preserve the behavior from previous releases. The default for splitMode is none.
  • If your input file contains multiple documents, you need to enable splitting by setting the splitMode to auto.

Custom classification models are deep-learning-model types that combine layout and language features to accurately detect and identify documents you process within your application. Custom classification models perform classification of an input file one page at a time to identify the document(s) within and can also identify multiple documents or multiple instances of a single document within an input file.

Model capabilities

Custom classification models can analyze a single- or multi-file documents to identify if any of the trained document types are contained within an input file. Here are the currently supported scenarios:

  • A single file containing one document. For instance, a loan application form.

  • A single file containing multiple documents. For instance, a loan application package containing a loan application form, payslip, and bank statement.

  • A single file containing multiple instances of the same document. For instance, a collection of scanned invoices.

✔️ Training a custom classifier requires at least two distinct classes and a minimum of five document samples per class. The model response contains the page ranges for each of the classes of documents identified.

✔️ The maximum allowed number of classes is 500. The maximum allowed number of document samples per class is 100.

The model classifies each page of the input document to one of the classes in the labeled dataset. Use the confidence score from the response to set the threshold for your application.

Compare custom classification and composed models

A custom classification model can replace a composed model in some scenarios but there are a few differences to be aware of:

Capability Custom classifier process Composed model process
Analyze a single document of unknown type belonging to one of the types trained for extraction model processing. ● Requires multiple calls.
● Call the classification model based on the document class. This step allows for a confidence-based check before invoking the extraction model analysis.
● Invoke the extraction model.
● Requires a single call to a composed model containing the model corresponding to the input document type.
Analyze a single document of unknown type belonging to several types trained for extraction model processing. ●Requires multiple calls.
● Make a call to the classifier that ignores documents not matching a designated type for extraction.
● Invoke the extraction model.
● Requires a single call to a composed model. The service selects a custom model within the composed model with the highest match.
● A composed model can't ignore documents.
Analyze a file containing multiple documents of known or unknown type belonging to one of the types trained for extraction model processing. ● Requires multiple calls.
● Call the extraction model for each identified document in the input file.
● Invoke the extraction model.
● Requires a single call to a composed model.
● The composed model invokes the component model once on the first instance of the document.
●The remaining documents are ignored.

Language support

Classification models currently only support English language documents.

Classification models can now be trained on documents of different languages. See supported languages for a complete list.

Input requirements

  • For best results, provide five clear photos or high-quality scans per document type.

  • Supported file formats:

    Model PDF Image:
    JPEG/JPG, PNG, BMP, TIFF, HEIF
    Microsoft Office:
    Word (DOCX), Excel (XLSX), PowerPoint (PPTX), and HTML
    Read
    Layout ✔ (2023-10-31-preview)
    General Document
    Prebuilt
    Custom
  • For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).

  • The file size for analyzing documents is 500 MB for paid (S0) tier and 4 MB for free (F0) tier.

  • Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.

  • If your PDFs are password-locked, you must remove the lock before submission.

  • The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).

  • For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.

  • For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.

  • For custom classification model training, the total size of training data is 1GB with a maximum of 10,000 pages.

Document splitting

When you have more than one document in a file, the classifier can identify the different document types contained within the input file. The classifier response contains the page ranges for each of the identified document types contained within a file. This response can include multiple instances of the same document type.

The analyze operation now includes a splitMode property that gives you granular control over the splitting behavior.

  • To treat the entire input file as a single document for classification set the splitMode to none. When you do so, the service returns just one class for the entire input file.
  • To classify each page of the input file, set the splitMode to perPage. The service attempts to classify each page as an individual document.
  • Set the splitMode to auto and the service identifies the documents and associated page ranges.

Best practices

Custom classification models require a minimum of five samples per class to train. If the classes are similar, adding extra training samples improves model accuracy.

The classifier attempts to assign each document to one of the classes, if you expect the model to see document types not in the classes that are part of the training dataset, you should plan to set a threshold on the classification score or add a few representative samples of the document types to an "other" class. Adding an "other" class ensures that unneeded documents don't impact your classifier quality.

Training a model

Custom classification models are supported by v4.0:2023-10-31-preview and v3.1:2023-07-31 (GA) APIs. Document Intelligence Studio provides a no-code user interface to interactively train a custom classifier. Follow the how to guide to get started.

When using the REST API, if you organize your documents by folders, you can use the azureBlobSource property of the request to train a classification model.


https://{endpoint}/documentintelligence/documentClassifiers:build?api-version=2023-10-31-preview

{
  "classifierId": "demo2.1",
  "description": "",
  "docTypes": {
    "car-maint": {
        "azureBlobSource": {
            "containerUrl": "SAS URL to container",
            "prefix": "sample1/car-maint/"
            }
    },
    "cc-auth": {
        "azureBlobSource": {
            "containerUrl": "SAS URL to container",
            "prefix": "sample1/cc-auth/"
            }
    },
    "deed-of-trust": {
        "azureBlobSource": {
            "containerUrl": "SAS URL to container",
            "prefix": "sample1/deed-of-trust/"
            }
    }
  }
}

https://{endpoint}/formrecognizer/documentClassifiers:build?api-version=2023-07-31

{
  "classifierId": "demo2.1",
  "description": "",
  "docTypes": {
    "car-maint": {
        "azureBlobSource": {
            "containerUrl": "SAS URL to container",
            "prefix": "{path to dataset root}/car-maint/"
            }
    },
    "cc-auth": {
        "azureBlobSource": {
            "containerUrl": "SAS URL to container",
            "prefix": "{path to dataset root}/cc-auth/"
            }
    },
    "deed-of-trust": {
        "azureBlobSource": {
            "containerUrl": "SAS URL to container",
            "prefix": "{path to dataset root}/deed-of-trust/"
            }
    }
  }
}

Alternatively, if you have a flat list of files or only plan to use a few select files within each folder to train the model, you can use the azureBlobFileListSource property to train the model. This step requires a file list in JSON Lines format. For each class, add a new file with a list of files to be submitted for training.

{
  "classifierId": "demo2",
  "description": "",
  "docTypes": {
    "car-maint": {
      "azureBlobFileListSource": {
        "containerUrl": "SAS URL to container",
        "fileList": "{path to dataset root}/car-maint.jsonl"
      }
    },
    "cc-auth": {
      "azureBlobFileListSource": {
        "containerUrl": "SAS URL to container",
        "fileList": "{path to dataset root}/cc-auth.jsonl"
      }
    },
    "deed-of-trust": {
      "azureBlobFileListSource": {
        "containerUrl": "SAS URL to container",
        "fileList": "{path to dataset root}/deed-of-trust.jsonl"
      }
    }
  }
}

As an example, the file list car-maint.jsonl contains the following files.

{"file":"classifier/car-maint/Commercial Motor Vehicle - Adatum.pdf"}
{"file":"classifier/car-maint/Commercial Motor Vehicle - Fincher.pdf"}
{"file":"classifier/car-maint/Commercial Motor Vehicle - Lamna.pdf"}
{"file":"classifier/car-maint/Commercial Motor Vehicle - Liberty.pdf"}
{"file":"classifier/car-maint/Commercial Motor Vehicle - Trey.pdf"}

Model response

Analyze an input file with the document classification model

https://{endpoint}/documentintelligence/documentClassifiers/{classifier}:analyze?api-version=2023-10-31-preview
https://{service-endpoint}/formrecognizer/documentClassifiers/{classifier}:analyze?api-version=2023-07-31

The response contains the identified documents with the associated page ranges in the documents section of the response.

{
  ...

    "documents": [
      {
        "docType": "formA",
        "boundingRegions": [
          { "pageNumber": 1, "polygon": [...] },
          { "pageNumber": 2, "polygon": [...] }
        ],
        "confidence": 0.97,
        "spans": []
      },
      {
        "docType": "formB",
        "boundingRegions": [
          { "pageNumber": 3, "polygon": [...] }
        ],
        "confidence": 0.97,
        "spans": []
      }, ...
    ]
  }

Next steps

Learn to create custom classification models: