Speech model customization, including pronunciation training, is only supported in Video Indexer Azure trial accounts and Resource Manager accounts. It is not supported in classic accounts. For guidance on how to update your account type at no cost, see the Update your Azure AI Video Indexer account. For guidance on using the custom language experience, see Customize a Language model.
Azure AI Video Indexer lets you create custom speech models to customize speech recognition by uploading datasets that are used to create a speech model. This article goes through the steps to do so through the Video Indexer website. You can also use the API, as described in Customize speech model using API.
As all custom models must contain a dataset, we'll start with the process of how to create and manage datasets.
Select the Model customization button.
Select the Speech (new) tab.
Select Upload dataset.
Select either Plain text or Pronunciation from the Dataset type dropdown menu. Every speech model must have a plain text dataset and can optionally have a pronunciation dataset.
Select Browse and select the dataset file. You can choose only one.
Select a Language for the model. Choose the language that is spoken in the media files you plan on indexing with this model. The Dataset name is prepopulated with the name of the file but you can modify the name.
You can optionally add a description of the dataset. This could be helpful to distinguish each dataset if you expect to have multiple datasets.
Select Upload. When the dataset creation is complete, you can use it for training and creation of new models.
Review and update a dataset
You can view a dataset and its properties by:
Clicking on the dataset name
Hovering over the dataset
Selecting the ellipsis
Then, select View Dataset.
You can then view the name, description, language, and status of the dataset plus the following properties:
Number of lines: indicates the number of lines successfully loaded out of the total number of lines in the file. If the entire file is loaded successfully the numbers will match (for example, 10 of 10 normalized). If the numbers don't match (for example, 7 of 10 normalized), this means that only some of the lines successfully loaded and the rest had errors. Common causes of errors are formatting issues with a line, such as not spacing a tab between each word in a pronunciation file. Reviewing the plain text and pronunciation data for training articles should be helpful in finding the issue. To troubleshoot the cause, review the error details, which are contained in the report. Select View report to view the error details regarding the lines that didn't load successfully (errorKind). This can also be viewed by selecting the Report tab.
Dataset ID: Each dataset has a unique GUID, which is needed when using the API for operations that reference the dataset.
Plain text (normalized): This contains the normalized text of the loaded dataset file. Normalized text is the recognized text in plain form without formatting.
Edit Details: To edit a dataset's name or description, when hovering over the dataset, select on the ellipsis and then select Edit details. You're then able to edit the dataset name and description.
Note
The data in a dataset can't be edited or updated once the dataset has been uploaded. If you need to edit or update the data in a dataset, download the dataset, perform the edits, save the file, and upload the new dataset file.
Download: To download a dataset file, when hovering over the dataset, select on the ellipsis and then select Download. Alternatively, when viewing the dataset, you can select Download and then have the option of downloading the dataset file or the upload report in JSON form.
Delete: To delete a dataset, when hovering over the dataset, select on the ellipsis and then select Delete.
Create a custom speech model
Datasets are used in the creation and training of models. Once you have created a plain text dataset, you can create and start using a custom speech model.
Keep in mind the following when creating and using custom speech models:
A new model must include at least one plain text dataset and can have multiple plain text datasets.
It's optional to include a pronunciation dataset and no more than one can be included.
Once a model is created, you can't add additional datasets to it or perform any modifications to its datasets. If you need to add or modify datasets, create a new model.
If you have indexed a video using a custom speech model and then delete the model, the transcript isn't impacted unless you perform a reindex.
If you deleted a dataset that was used to train a custom model, as the speech model was already trained by the dataset, it continues to use it until the speech model is deleted.
If you delete a custom model, it has no impact of the transcription of videos that were already indexed using the model.
Train a model
Note
Once a model is created, datasets can't be added.
A model can only contain datasets of the same language.
There are two ways to train a model – through the dataset tab and through the model tab.
Train a model through the Datasets tab
View the list of datasets.
Select a plain text dataset. The Train new model icon above can then be selected.
Select Train new model.
Enter a name for the model, a language, and optionally add a description.
Select the Datasets tab
Select the datasets you want to be included in the model.
Select Create and train.
Train a model through the Models tab
Select the Models tab.
Select Train new model icon.
Select the datasets that you want to be part of the model.
Enter a name for the model, a language, and optionally add a description.
Select the Datasets tab.
Select the datasets you want to be included in the model.
Select Create and train.
Review and update a model
View Model: You can view a model and its properties by either clicking on the model’s name or when hovering over the model, clicking on the ellipsis and then selecting View Model.
You'll then see in the Details tab the name, description, language, and status of the model plus the following properties:
Model ID: Each model has a unique GUID, which is needed when using the API for operations that reference the model.
Created on: The date the model was created.
Edit Details: To edit a model’s name or description, when hovering over the model, select on the ellipsis and then select Edit details. You're then able to edit the model’s name and description.
Note
Only the model’s name and description can be edited. If you want to make any changes to its datasets or add datasets, a new model must be created.
Delete: To delete a model, when hovering over the dataset, select on the ellipsis and then select Delete.
Included datasets: Select on the Included datasets tab to view the model’s datasets.
Use a custom language model when indexing a video
A custom language model isn't used by default for indexing jobs, so must be selected during the index upload process.
During the upload process, select your custom language model source from the language drop-down menu.
Select Upload.
The same steps apply when you want to reindex a video with a custom model.
Note
Speech model customization, including pronunciation training, is only supported in Video Indexer Azure trial accounts and Resource Manager accounts. It is not supported in classic accounts. For guidance on how to update your account type at no cost, see the Update your Azure AI Video Indexer account. For guidance on using the custom language experience, see Customize a Language model.
The following is a table of descriptions of some of the parameters used with the speech model requests:
Name
Type
Description
displayName
string
The desired name of the dataset/model.
locale
string
The language code of the dataset/model. For full list, see Language support.
kind
integer
0 for a plain text dataset, 1 for a pronunciation dataset.
description
string
Optional description of the dataset/model.
contentUrl
uri
URL of source file used in creation of dataset.
customProperties
object
Optional properties of dataset/model.
Create a speech dataset
The Create Speech Dataset request creates a dataset for training a speech model. Upload a file that is used to create a dataset with this request. The content of a dataset can't be modified after it's created.
Define the parameters in the request body, including a URL to the text file to be uploaded. The description and custom properties fields are optional. This is an example of a request body:
The Create Speech Model request creates and trains a custom speech model that can be used to improve the transcription accuracy of your videos. It must contain at least one plain text dataset. It can optionally have pronunciation datasets. Create it with all of the relevant dataset files as a model’s datasets can't be added or updated after its creation.
Define the parameters in the request body, including a list of strings that the dataset or datasets for the model to include. The description and custom properties fields are optional. This is a sample of a request body:
The Delete Speech Dataset API deletes the specified dataset. Any model that was trained with the deleted dataset continues to be available until the model is deleted. You can't delete a dataset while it is in use for indexing or training.
Example response
There's no returned content when the dataset is deleted successfully.
Delete a speech model
The Delete Speech Model API deletes the specified speech model. You can't delete a model while it is in use for indexing or training.
Response
There's no returned content when the speech model is deleted successfully.