Редагувати

Поділитися через


Build and train a custom classification model

This content applies to:checkmark v4.0 (preview) | Previous versions: blue-checkmark v3.1 (GA) blue-checkmark v3.0 (GA)

Important

Custom classification model is currently in public preview. Features, approaches, and processes may change, prior to General Availability (GA), based on user feedback.

Custom classification models can classify each page in an input file to identify one or more documents within. Classifier models can also identify multiple documents or multiple instances of a single document in the input file. Document Intelligence custom models require as few as five training documents per document class to get started. To get started training a custom classification model, you need at least five documents for each class and two classes of documents.

Custom classification model input requirements

Make sure your training data set follows the input requirements for Document Intelligence.

  • Supported file formats:

    Model PDF Image:
    JPEG/JPG, PNG, BMP, TIFF, HEIF
    Microsoft Office:
    Word (DOCX), Excel (XLSX), PowerPoint (PPTX), HTML
    Read
    Layout ✔ (2024-07-31-preview, 2024-02-29-preview, 2023-10-31-preview)
    General Document
    Prebuilt
    Custom extraction
    Custom classification ✔ (2024-07-31-preview, 2024-02-29-preview)
  • For best results, provide one clear photo or high-quality scan per document.

  • For PDF and TIFF, up to 2,000 pages can be processed (with a free tier subscription, only the first two pages are processed).

  • The file size for analyzing documents is 500 MB for paid (S0) tier and 4 MB for free (F0) tier.

  • Image dimensions must be between 50 pixels x 50 pixels and 10,000 pixels x 10,000 pixels.

  • If your PDFs are password-locked, you must remove the lock before submission.

  • The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8 point text at 150 dots per inch (DPI).

  • For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.

    • For custom extraction model training, the total size of training data is 50 MB for template model and 1 GB for the neural model.

    • For custom classification model training, the total size of training data is 1 GB with a maximum of 10,000 pages. For 2024-07-31-preview and later, the total size of training data is 2 GB with a maximum of 10,000 pages.

Training data tips

Follow these tips to further optimize your data set for training:

  • If possible, use text-based PDF documents instead of image-based documents. Scanned PDFs are handled as images.

  • If your form images are of lower quality, use a larger data set (10-15 images, for example).

Upload your training data

Once you put together the set of forms or documents for training, you need to upload it to an Azure blob storage container. If you don't know how to create an Azure storage account with a container, follow the Azure Storage quickstart for Azure portal. You can use the free pricing tier (F0) to try the service, and upgrade later to a paid tier for production. If your dataset is organized as folders, preserve that structure as the Studio can use your folder names for labels to simplify the labeling process.

Create a classification project in the Document Intelligence Studio

The Document Intelligence Studio provides and orchestrates all the API calls required to complete your dataset and train your model.

  1. Start by navigating to the Document Intelligence Studio. The first time you use the Studio, you need to initialize your subscription, resource group, and resource. Then, follow the prerequisites for custom projects to configure the Studio to access your training dataset.

  2. In the Studio, select the Custom classification model tile, on the custom models section of the page and select the Create a project button.

    Screenshot of how to create a classifier project in the Document Intelligence Studio.

    1. On the Create Project dialog, provide a name for your project, optionally a description, and select continue.

    2. Next, choose, or select create a Document Intelligence resource before you continue.

    Screenshot showing the project setup dialog window.

  3. Next select the storage account you used to upload your custom model training dataset. The Folder path should be empty if your training documents are in the root of the container. If your documents are in a subfolder, enter the relative path from the container root in the Folder path field. Once your storage account is configured, select continue.

    Important

    You can either organize the training dataset by folders where the folder name is the label or class for documents or create a flat list of documents that you can assign a label to in the Studio.

    Screenshot showing how to select the Document Intelligence resource.

  4. Training a custom classifier requires the output from the Layout model for each document in your dataset. Run layout on all documents before the model training process.

  5. Finally, review your project settings and select Create Project to create a new project. You should now be in the labeling window and see the files in your dataset listed.

Label your data

In your project, you only need to label each document with the appropriate class label.

Screenshot showing elect the Document Intelligence resource.

You see the files you uploaded to storage in the file list, ready to be labeled. You have a few options to label your dataset.

  1. If the documents are organized in folders, the Studio prompts you to use the folder names as labels. This step simplifies your labeling down to a single select.

  2. To assign a label to a document, select on the add label selection mark to assign a label.

  3. Control select to multi-select documents to assign a label

You should now have all the documents in your dataset labeled. If you look at the storage account, you find .ocr.json files that correspond to each document in your training dataset and a new class-name.jsonl file for each class labeled. This training dataset is submitted to train the model.

Train your model

With your dataset labeled, you're now ready to train your model. Select the train button in the upper-right corner.

  1. On the train model dialog, provide a unique classifier ID and, optionally, a description. The classifier ID accepts a string data type.

  2. Select Train to initiate the training process.

  3. Classifier models train in a few minutes.

  4. Navigate to the Models menu to view the status of the train operation.

Test the model

Once the model training is complete, you can test your model by selecting the model on the models list page.

  1. Select the model and select on the Test button.

  2. Add a new file by browsing for a file or dropping a file into the document selector.

  3. With a file selected, choose the Analyze button to test the model.

  4. The model results are displayed with the list of identified documents, a confidence score for each document identified and the page range for each of the documents identified.

  5. Validate your model by evaluating the results for each document identified.

Training a custom classifier using the SDK or API

The Studio orchestrates the API calls for you to train a custom classifier. The classifier training dataset requires the output from the layout API that matches the version of the API for your training model. Using layout results from an older API version can result in a model with lower accuracy.

The Studio generates the layout results for your training dataset if the dataset doesn't contain layout results. When using the API or SDK to train a classifier, you need to add the layout results to the folders containing the individual documents. The layout results should be in the format of the API response when calling layout directly. The SDK object model is different. Make sure that the layout results are the API results and not the SDK response.

Troubleshoot

The classification model requires results from the layout model for each training document. If you don't provide the layout results, the Studio attempts to run the layout model for each document before training the classifier. This process is throttled and can result in a 429 response.

In the Studio, before training with the classification model, run the layout model on each document and upload it to the same location as the original document. Once the layout results are added, you can train the classifier model with your documents.

Next steps