Customize a speech model

Article
10/09/2024

Note

Speech model customization, including pronunciation training, is only supported in Video Indexer Azure trial accounts and Resource Manager accounts. It is not supported in classic accounts. For guidance on how to update your account type at no cost, see the Update your Azure AI Video Indexer account. For guidance on using the custom language experience, see Customize a Language model.

Azure AI Video Indexer lets you create custom speech models to customize speech recognition by uploading datasets that are used to create a speech model. This article goes through the steps to do so through the Video Indexer website. You can also use the API, as described in Customize speech model using API.

For a detailed overview and best practices for custom speech models, see Customize a speech model with Azure AI Video Indexer.

Prerequisites

Read the Speech model training best practices guide.
An Azure account
An Azure AI Video Indexer account

Web portal
API

Create a dataset

As all custom models must contain a dataset, we'll start with the process of how to create and manage datasets.

Select the Model customization button.
Select the Speech (new) tab.
Select Upload dataset.
Select either Plain text or Pronunciation from the Dataset type dropdown menu. Every speech model must have a plain text dataset and can optionally have a pronunciation dataset.
Select Browse and select the dataset file. You can choose only one.
Select a Language for the model. Choose the language that is spoken in the media files you plan on indexing with this model. The Dataset name is prepopulated with the name of the file but you can modify the name.
You can optionally add a description of the dataset. This could be helpful to distinguish each dataset if you expect to have multiple datasets.
Select Upload. When the dataset creation is complete, you can use it for training and creation of new models.

Review and update a dataset

You can view a dataset and its properties by:

Clicking on the dataset name
Hovering over the dataset
Selecting the ellipsis

Then, select View Dataset.

You can then view the name, description, language, and status of the dataset plus the following properties:

Number of lines: indicates the number of lines successfully loaded out of the total number of lines in the file. If the entire file is loaded successfully the numbers will match (for example, 10 of 10 normalized). If the numbers don't match (for example, 7 of 10 normalized), this means that only some of the lines successfully loaded and the rest had errors. Common causes of errors are formatting issues with a line, such as not spacing a tab between each word in a pronunciation file. Reviewing the plain text and pronunciation data for training articles should be helpful in finding the issue. To troubleshoot the cause, review the error details, which are contained in the report. Select View report to view the error details regarding the lines that didn't load successfully (errorKind). This can also be viewed by selecting the Report tab.

Dataset ID: Each dataset has a unique GUID, which is needed when using the API for operations that reference the dataset.

Plain text (normalized): This contains the normalized text of the loaded dataset file. Normalized text is the recognized text in plain form without formatting.

Edit Details: To edit a dataset's name or description, when hovering over the dataset, select on the ellipsis and then select Edit details. You're then able to edit the dataset name and description.

Note

The data in a dataset can't be edited or updated once the dataset has been uploaded. If you need to edit or update the data in a dataset, download the dataset, perform the edits, save the file, and upload the new dataset file.

Download: To download a dataset file, when hovering over the dataset, select on the ellipsis and then select Download. Alternatively, when viewing the dataset, you can select Download and then have the option of downloading the dataset file or the upload report in JSON form.

Delete: To delete a dataset, when hovering over the dataset, select on the ellipsis and then select Delete.

Create a custom speech model

Datasets are used in the creation and training of models. Once you have created a plain text dataset, you can create and start using a custom speech model.

Keep in mind the following when creating and using custom speech models:

A new model must include at least one plain text dataset and can have multiple plain text datasets.
It's optional to include a pronunciation dataset and no more than one can be included.
Once a model is created, you can't add additional datasets to it or perform any modifications to its datasets. If you need to add or modify datasets, create a new model.
If you have indexed a video using a custom speech model and then delete the model, the transcript isn't impacted unless you perform a reindex.
If you deleted a dataset that was used to train a custom model, as the speech model was already trained by the dataset, it continues to use it until the speech model is deleted.
If you delete a custom model, it has no impact of the transcription of videos that were already indexed using the model.

Train a model

Note

Once a model is created, datasets can't be added. A model can only contain datasets of the same language.

There are two ways to train a model – through the dataset tab and through the model tab.

Train a model through the Datasets tab

View the list of datasets.
Select a plain text dataset. The Train new model icon above can then be selected.
Select Train new model.
Enter a name for the model, a language, and optionally add a description.
Select the Datasets tab
Select the datasets you want to be included in the model.
Select Create and train.

Train a model through the Models tab

Select the Models tab.
Select Train new model icon.
Select the datasets that you want to be part of the model.
Enter a name for the model, a language, and optionally add a description.
Select the Datasets tab.
Select the datasets you want to be included in the model.
Select Create and train.

Review and update a model

View Model: You can view a model and its properties by either clicking on the model’s name or when hovering over the model, clicking on the ellipsis and then selecting View Model.

You'll then see in the Details tab the name, description, language, and status of the model plus the following properties:

Model ID: Each model has a unique GUID, which is needed when using the API for operations that reference the model.

Created on: The date the model was created.

Edit Details: To edit a model’s name or description, when hovering over the model, select on the ellipsis and then select Edit details. You're then able to edit the model’s name and description.

Note

Only the model’s name and description can be edited. If you want to make any changes to its datasets or add datasets, a new model must be created.

Delete: To delete a model, when hovering over the dataset, select on the ellipsis and then select Delete.

Included datasets: Select on the Included datasets tab to view the model’s datasets.

Use a custom language model when indexing a video

A custom language model isn't used by default for indexing jobs, so must be selected during the index upload process.

During the upload process, select your custom language model source from the language drop-down menu.
Select Upload.

The same steps apply when you want to reindex a video with a custom model.

Note

The following is a table of descriptions of some of the parameters used with the speech model requests:

Name	Type	Description
`displayName`	string	The desired name of the dataset/model.
`locale`	string	The language code of the dataset/model. For full list, see Language support.
`kind`	integer	0 for a plain text dataset, 1 for a pronunciation dataset.
`description`	string	Optional description of the dataset/model.
`contentUrl`	uri	URL of source file used in creation of dataset.
`customProperties`	object	Optional properties of dataset/model.

Create a speech dataset

The Create Speech Dataset request creates a dataset for training a speech model. Upload a file that is used to create a dataset with this request. The content of a dataset can't be modified after it's created.

Define the parameters in the request body, including a URL to the text file to be uploaded. The description and custom properties fields are optional. This is an example of a request body:

{
    "displayName": "Pronunciation Dataset",
    "locale": "en-US",
    "kind": "Pronunciation",
    "description": "This is a pronunciation dataset.",
    "contentUrl": https://contoso.com/location,
    "customProperties": {
        "tag": "Pronunciation Dataset Example"
    }
}

Example response

{ 
    "id": "000000-0000-0000-0000-f58ac7002ae9", 
    "properties": { 
        "acceptedLineCount": 0, 
        "rejectedLineCount": 0, 
        "duration": null, 
        "error": null 
    }, 
    "displayName": "Contoso plain text", 
    "description": "VI dataset", 
    "locale": "en-US", 
    "kind": "Language", 
    "status": "Waiting", 
    "lastActionDateTime": "2023-02-28T13:24:27Z", 
    "createdDateTime": "2023-02-28T13:24:27Z", 
    "customProperties": null 
}

Create a speech model

The Create Speech Model request creates and trains a custom speech model that can be used to improve the transcription accuracy of your videos. It must contain at least one plain text dataset. It can optionally have pronunciation datasets. Create it with all of the relevant dataset files as a model’s datasets can't be added or updated after its creation.

Define the parameters in the request body, including a list of strings that the dataset or datasets for the model to include. The description and custom properties fields are optional. This is a sample of a request body:

{
    "displayName": "Contoso Speech Model",
    "locale": "en-US",
    "datasets": ["ff3d2bc4-ab5a-4522-b599-b3d5ba768c75", "87c8962d-1d3c-44e5-a2b2-c696fddb9bae"],
    "description": "Contoso ads example model",
    "customProperties": {
        "tag": "Example Model"
    }
}

Example response

{ 
    "id": "00000000-0000-0000-0000-85be4454cf", 
    "properties": { 
        "deprecationDates": { 
            "adaptationDateTime": null, 
            "transcriptionDateTime": "2025-04-15T00:00:00Z" 
        }, 
        "error": null 
    }, 
    "displayName": "Contoso speech model", 
    "description": "Contoso speech model for video indexer", 
    "locale": "en-US", 
    "datasets": ["00000000-0000-0000-0000-f58ac7002ae9"], 
    "status": "Processing", 
    "lastActionDateTime": "2023-02-28T13:36:28Z", 
    "createdDateTime": "2023-02-28T13:36:28Z", 
    "customProperties": null 
}

Get speech dataset

The Get Speech Dataset API returns information on the specified dataset.

Example response

{ 
    "id": "00000000-0000-0000-0000-f58002ae9", 
    "properties": { 
        "acceptedLineCount": 41, 
        "rejectedLineCount": 0, 
        "duration": null, 
        "error": null 
    }, 
    "displayName": "Contoso plain text", 
    "description": "VI dataset", 
    "locale": "en-US", 
    "kind": "Language", 
    "status": "Complete", 
    "lastActionDateTime": "2023-02-28T13:24:43Z", 
    "createdDateTime": "2023-02-28T13:24:27Z", 
    "customProperties": null 
}

Get speech datasets files

The Get Speech Dataset Files request returns the files and metadata of the specified dataset.

Example response

[{ 
    "datasetId": "00000000-0000-0000-0000-f58ac72a", 
    "fileId": "00000000-0000-0000-0000-cb190769c", 
    "name": "languagedata", 
    "contentUrl": "", 
    "kind": "LanguageData", 
    "createdDateTime": "2023-02-28T13:24:43Z", 
    "properties": { 
        "size": 1517 
    } 
}, { 
    "datasetId": "00000000-0000-0000-0000-f58ac72” 
    "fileId": "00000000-0000-0000-0000-2369192e", 
    "name": "normalized.txt", 
    "contentUrl": "", 
    "kind": "LanguageData", 
    "createdDateTime": "2023-02-28T13:24:43Z", 
    "properties": { 
        "size": 1517 
    } 
}, { 
    "datasetId": "00000000-0000-0000-0000-f58ac7", 
    "fileId": "00000000-0000-0000-0000-05f1e306", 
    "name": "report.json", 
    "contentUrl": "", 
    "kind": "DatasetReport", 
    "createdDateTime": "2023-02-28T13:24:43Z", 
    "properties": { 
        "size": 78 
    } 
}]

Get the specified account datasets

The Get Speech Datasets request returns information on all of the specified accounts datasets.

Example response

[{ 
    "id": "00000000-0000-0000-abf5-4dad0f", 
    "properties": { 
        "acceptedLineCount": 41, 
        "rejectedLineCount": 0, 
        "duration": null, 
        "error": null 
    }, 
    "displayName": "test", 
    "description": "string", 
    "locale": "en-US", 
    "kind": "Language", 
    "status": "Complete", 
    "lastActionDateTime": "2023-02-27T08:42:02Z", 
    "createdDateTime": "2023-02-27T08:41:39Z", 
    "customProperties": null 
}]

Get the specified speech model

The Get Speech Model API returns information on the specified model.

Example response

{ 
    "id": "00000000-0000-0000-0000-5685be445", 
    "properties": { 
        "deprecationDates": { 
            "adaptationDateTime": null, 
            "transcriptionDateTime": "2025-04-15T00:00:00Z" 
        }, 
        "error": null 
    }, 
    "displayName": "Contoso speech model", 
    "description": "Contoso speech model for video indexer", 
    "locale": "en-US", 
    "datasets": ["00000000-0000-0000-0000-f58ac7002"], 
    "status": "Complete", 
    "lastActionDateTime": "2023-02-28T13:36:38Z", 
    "createdDateTime": "2023-02-28T13:36:28Z", 
    "customProperties": null 
}

Get the specified account speech models

The Get Speech Models API returns information on all of the models in the specified account.

Example response

[{ 
    "id": "00000000-0000-0000-0000-5685be445", 
    "properties": { 
        "deprecationDates": { 
            "adaptationDateTime": null, 
            "transcriptionDateTime": "2025-04-15T00:00:00Z" 
        }, 
        "error": null 
    }, 
    "displayName": "Contoso speech model", 
    "description": "Contoso speech model for video indexer", 
    "locale": "en-US", 
    "datasets": ["00000000-0000-0000-0000-f58ac7002a"], 
    "status": "Complete", 
    "lastActionDateTime": "2023-02-28T13:36:38Z", 
    "createdDateTime": "2023-02-28T13:36:28Z", 
    "customProperties": null 
}]

Delete speech dataset

The Delete Speech Dataset API deletes the specified dataset. Any model that was trained with the deleted dataset continues to be available until the model is deleted. You can't delete a dataset while it is in use for indexing or training.

Example response

There's no returned content when the dataset is deleted successfully.

Delete a speech model

The Delete Speech Model API deletes the specified speech model. You can't delete a model while it is in use for indexing or training.

Response

There's no returned content when the speech model is deleted successfully.

Share via

Customize a speech model

Prerequisites

Create a dataset

Review and update a dataset

Create a custom speech model

Train a model

Train a model through the Datasets tab

Train a model through the Models tab

Review and update a model

Use a custom language model when indexing a video

Create a speech dataset

Example response

Create a speech model

Example response

Get speech dataset

Example response

Get speech datasets files

Example response

Get the specified account datasets

Example response

Get the specified speech model

Example response

Get the specified account speech models

Example response

Delete speech dataset

Example response

Delete a speech model

Response

Feedback

Additional resources