Build and train a custom model

This article applies to: Form Recognizer v3.0 checkmark Form Recognizer v3.0. Earlier version: Form Recognizer v2.1

Form Recognizer models require as few as five training documents to get started. If you have at least five documents, you can get started training a custom model. You can train either a custom template model (custom form) or a custom neural model (custom document). The training process is identical for both models and this document walks you through the process of training either model.

Custom model input requirements

First, make sure your training data set follows the input requirements for Form Recognizer.

  • For best results, provide one clear photo or high-quality scan per document.

  • Supported file formats:

    Model PDF Image:
    JPEG/JPG, PNG, BMP, and TIFF
    Microsoft Office:
    Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML
    Read REST API version
    2022/06/30-preview
    Layout
    General Document
    Prebuilt
    Custom

    ✱ Microsoft Office files are currently not supported for other models or versions.

  • For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).

  • The file size for analyzing documents must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.

  • Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.

  • PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.

  • If your PDFs are password-locked, you must remove the lock before submission.

  • The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).

  • For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.

  • For custom model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.

Training data tips

Follow these tips to further optimize your data set for training:

  • If possible, use text-based PDF documents instead of image-based documents. Scanned PDFs are handled as images.
  • For forms with input fields, use examples that have all of the fields completed.
  • Use forms with different values in each field.
  • If your form images are of lower quality, use a larger data set (10-15 images, for example).

Upload your training data

Once you've put together the set of forms or documents for training, you'll need to upload it to an Azure blob storage container. If you don't know how to create an Azure storage account with a container, following the Azure Storage quickstart for Azure portal. You can use the free pricing tier (F0) to try the service, and upgrade later to a paid tier for production.

Video: Train your custom model

  • Once you've gathered and uploaded your training dataset, you're ready to train your custom model. In the following video, we'll create a project and explore some of the fundamentals for successfully labeling and training a model.

Create a project in the Form Recognizer Studio

The Form Recognizer Studio provides and orchestrates all the API calls required to complete your dataset and train your model.

  1. Start by navigating to the Form Recognizer Studio. The first time you use the Studio, you'll need to initialize your subscription, resource group, and resource. Then, follow the prerequisites for custom projects to configure the Studio to access your training dataset.

  2. In the Studio, select the Custom models tile, on the custom models page and select the Create a project button.

    Screenshot: Create a project in the Form Recognizer Studio.

    1. On the create project dialog, provide a name for your project, optionally a description, and select continue.

    2. On the next step in the workflow, choose or create a Form Recognizer resource before you select continue.

    Important

    Custom neural models models are only available in a few regions. If you plan on training a neural model, please select or create a resource in one of these supported regions.

    Screenshot: Select the Form Recognizer resource.

  3. Next select the storage account you used to upload your custom model training dataset. The Folder path should be empty if your training documents are in the root of the container. If your documents are in a subfolder, enter the relative path from the container root in the Folder path field. Once your storage account is configured, select continue.

    Screenshot: Select the storage account.

  4. Finally, review your project settings and select Create Project to create a new project. You should now be in the labeling window and see the files in your dataset listed.

Label your data

In your project, your first task is to label your dataset with the fields you wish to extract.

You'll see the files you uploaded to storage on the left of your screen, with the first file ready to be labeled.

  1. To start labeling your dataset, create your first field by selecting the plus (➕) button on the top-right of the screen to select a field type.

    Screenshot: Create a label.

  2. Enter a name for the field.

  3. To assign a value to the field, choose a word or words in the document and select the field in either the dropdown or the field list on the right navigation bar. You'll see the labeled value below the field name in the list of fields.

  4. Repeat the process for all the fields you wish to label for your dataset.

  5. Label the remaining documents in your dataset by selecting each document and selecting the text to be labeled.

You now have all the documents in your dataset labeled. If you look at the storage account, you'll find a .labels.json and .ocr.json files that correspond to each document in your training dataset and a new fields.json file. This training dataset will be submitted to train the model.

Train your model

With your dataset labeled, you're now ready to train your model. Select the train button in the upper-right corner.

  1. On the train model dialog, provide a unique model ID and, optionally, a description. The model ID accepts a string data type.

  2. For the build mode, select the type of model you want to train. Learn more about the model types and capabilities.

    Screenshot: Train model dialog

  3. Select Train to initiate the training process.

  4. Template models train in a few minutes. Neural models can take up to 30 minutes to train.

  5. Navigate to the Models menu to view the status of the train operation.

Test the model

Once the model training is complete, you can test your model by selecting the model on the models list page.

  1. Select the model and select on the Test button.

  2. Select the + Add button to select a file to test the model.

  3. With a file selected, choose the Analyze button to test the model.

  4. The model results are displayed in the main window and the fields extracted are listed in the right navigation bar.

  5. Validate your model by evaluating the results for each field.

  6. The right navigation bar also has the sample code to invoke your model and the JSON results from the API.

Congratulations you've trained a custom model in the Form Recognizer Studio! Your model is ready for use with the REST API or the SDK to analyze documents.

Next steps

Applies to: Form Recognizer v2.1 checkmark Form Recognizer v2.1. Other versions: Form Recognizer v3.0

When you use the Form Recognizer custom model, you provide your own training data to the Train Custom Model operation, so that the model can train to your industry-specific forms. Follow this guide to learn how to collect and prepare data to train the model effectively.

You need at least five filled-in forms of the same type.

If you want to use manually labeled training data, you must start with at least five filled-in forms of the same type. You can still use unlabeled forms in addition to the required data set.

Custom model input requirements

First, make sure your training data set follows the input requirements for Form Recognizer.

  • For best results, provide one clear photo or high-quality scan per document.

  • Supported file formats:

    Model PDF Image:
    JPEG/JPG, PNG, BMP, and TIFF
    Microsoft Office:
    Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML
    Read REST API version
    2022/06/30-preview
    Layout
    General Document
    Prebuilt
    Custom

    ✱ Microsoft Office files are currently not supported for other models or versions.

  • For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).

  • The file size for analyzing documents must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.

  • Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.

  • PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.

  • If your PDFs are password-locked, you must remove the lock before submission.

  • The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).

  • For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.

  • For custom model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.

Training data tips

Follow these tips to further optimize your data set for training.

  • If possible, use text-based PDF documents instead of image-based documents. Scanned PDFs are handled as images.
  • For filled-in forms, use examples that have all of their fields filled in.
  • Use forms with different values in each field.
  • If your form images are of lower quality, use a larger data set (10-15 images, for example).

Upload your training data

When you've put together the set of form documents that you'll use for training, you need to upload it to an Azure blob storage container. If you don't know how to create an Azure storage account with a container, follow the Azure Storage quickstart for Azure portal. Use the standard performance tier.

If you want to use manually labeled data, you'll also have to upload the .labels.json and .ocr.json files that correspond to your training documents. You can use the Sample Labeling tool (or your own UI) to generate these files.

Organize your data in subfolders (optional)

By default, the Train Custom Model API will only use documents that are located at the root of your storage container. However, you can train with data in subfolders if you specify it in the API call. Normally, the body of the Train Custom Model call has the following format, where <SAS URL> is the Shared access signature URL of your container:

{
  "source":"<SAS URL>"
}

If you add the following content to the request body, the API will train with documents located in subfolders. The "prefix" field is optional and will limit the training data set to files whose paths begin with the given string. So a value of "Test", for example, will cause the API to look at only the files or folders that begin with the word "Test".

{
  "source": "<SAS URL>",
  "sourceFilter": {
    "prefix": "<prefix string>",
    "includeSubFolders": true
  },
  "useLabelFile": false
}

Next steps

Now that you've learned how to build a training data set, follow a quickstart to train a custom Form Recognizer model and start using it on your forms.

See also