How to use batch synthesis for text to speech avatar
The batch synthesis API for text to speech avatar allows for the asynchronous synthesis of text into a talking avatar as a video file. Publishers and video content platforms can utilize this API to create avatar video content in a batch. That approach can be suitable for various use cases such as training materials, presentations, or advertisements.
The synthetic avatar video will be generated asynchronously after the system receives text input. The generated video output can be downloaded in batch mode synthesis. You submit text for synthesis, poll for the synthesis status, and download the video output when the status indicates success. The text input formats must be plain text or Speech Synthesis Markup Language (SSML) text.
This diagram provides a high-level overview of the workflow.
To perform batch synthesis, you can use the following REST API operations.
Operation | Method | REST API call |
---|---|---|
Create batch synthesis | PUT | avatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01 |
Get batch synthesis | GET | avatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01 |
List batch synthesis | GET | avatar/batchsyntheses/?api-version=2024-08-01 |
Delete batch synthesis | DELETE | avatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01 |
You can refer to the code samples on GitHub.
Create a batch synthesis request
Some properties in JSON format are required when you create a new batch synthesis job. Other properties are optional. The batch synthesis response includes other properties to provide information about the synthesis status and results. For example, the outputs.result
property contains the location from where you can download a video file containing the avatar video. From outputs.summary
, you can access the summary and debug details.
To submit a batch synthesis request, construct the HTTP POST request body following these instructions:
- Set the required
inputKind
property. - If the
inputKind
property is set toPlainText
, you must also set thevoice
property in thesynthesisConfig
. In the example below, theinputKind
is set toSSML
, so thespeechSynthesis
isn't set. - Set the required
SynthesisId
property. Choose a uniqueSynthesisId
for the same speech resource. TheSynthesisId
can be a string of 3 to 64 characters, including letters, numbers, '-', or '_', with the condition that it must start and end with a letter or number. - Set the required
talkingAvatarCharacter
andtalkingAvatarStyle
properties. You can find supported avatar characters and styles here. - Optionally, you can set the
videoFormat
,backgroundColor
, and other properties. For more information, see batch synthesis properties.
Note
The maximum JSON payload size accepted is 500 kilobytes.
Each Speech resource can have up to 200 batch synthesis jobs running concurrently.
The maximum length for the output video is currently 20 minutes, with potential increases in the future.
To make an HTTP PUT request, use the URI format shown in the following example. Replace YourSpeechKey
with your Speech resource key, YourSpeechRegion
with your Speech resource region, and set the request body properties as described above.
curl -v -X PUT -H "Ocp-Apim-Subscription-Key: YourSpeechKey" -H "Content-Type: application/json" -d '{
"inputKind": "SSML",
"inputs": [
{
"content": "<speak version='\''1.0'\'' xml:lang='\''en-US'\''><voice name='\''en-US-AvaMultilingualNeural'\''>The rainbow has seven colors.</voice></speak>"
}
],
"avatarConfig": {
"talkingAvatarCharacter": "lisa",
"talkingAvatarStyle": "graceful-sitting"
}
}' "https://YourSpeechRegion.api.cognitive.microsoft.com/avatar/batchsyntheses/my-job-01?api-version=2024-08-01"
You should receive a response body in the following format:
{
"id": "my-job-01",
"internalId": "5a25b929-1358-4e81-a036-33000e788c46",
"status": "NotStarted",
"createdDateTime": "2024-03-06T07:34:08.9487009Z",
"lastActionDateTime": "2024-03-06T07:34:08.9487012Z",
"inputKind": "SSML",
"customVoices": {},
"properties": {
"timeToLiveInHours": 744,
},
"avatarConfig": {
"talkingAvatarCharacter": "lisa",
"talkingAvatarStyle": "graceful-sitting",
"videoFormat": "Mp4",
"videoCodec": "hevc",
"subtitleType": "soft_embedded",
"bitrateKbps": 2000,
"customized": false
}
}
The status
property should progress from NotStarted
status to Running
and finally to Succeeded
or Failed
. You can periodically call the GET batch synthesis API until the returned status is Succeeded
or Failed
.
Get batch synthesis
To retrieve the status of a batch synthesis job, make an HTTP GET request using the URI as shown in the following example.
Replace YourSynthesisId
with your batch synthesis ID, YourSpeechKey
with your Speech resource key, and YourSpeechRegion
with your Speech resource region.
curl -v -X GET "https://YourSpeechRegion.api.cognitive.microsoft.com/avatar/batchsyntheses/YourSynthesisId?api-version=2024-08-01" -H "Ocp-Apim-Subscription-Key: YourSpeechKey"
You should receive a response body in the following format:
{
"id": "my-job-01",
"internalId": "5a25b929-1358-4e81-a036-33000e788c46",
"status": "Succeeded",
"createdDateTime": "2024-03-06T07:34:08.9487009Z",
"lastActionDateTime": "2024-03-06T07:34:12.5698769",
"inputKind": "SSML",
"customVoices": {},
"properties": {
"timeToLiveInHours": 744,
"sizeInBytes": 344460,
"durationInMilliseconds": 2520,
"succeededCount": 1,
"failedCount": 0,
"billingDetails": {
"neuralCharacters": 29,
"talkingAvatarDurationSeconds": 2
}
},
"avatarConfig": {
"talkingAvatarCharacter": "lisa",
"talkingAvatarStyle": "graceful-sitting",
"videoFormat": "Mp4",
"videoCodec": "hevc",
"subtitleType": "soft_embedded",
"bitrateKbps": 2000,
"customized": false
},
"outputs": {
"result": "https://stttssvcprodusw2.blob.core.windows.net/batchsynthesis-output/xxxxx/xxxxx/0001.mp4?SAS_Token",
"summary": "https://stttssvcprodusw2.blob.core.windows.net/batchsynthesis-output/xxxxx/xxxxx/summary.json?SAS_Token"
}
}
From the outputs.result
field, you can download a video file containing the avatar video. The outputs.summary
field allows you to download the summary and debug details. For more information on batch synthesis results, see batch synthesis results.
List batch synthesis
To list all batch synthesis jobs for your Speech resource, make an HTTP GET request using the URI as shown in the following example.
Replace YourSpeechKey
with your Speech resource key and YourSpeechRegion
with your Speech resource region. Optionally, you can set the skip
and top
(page size) query parameters in the URL. The default value for skip
is 0, and the default value for maxpagesize
is 100.
curl -v -X GET "https://YourSpeechRegion.api.cognitive.microsoft.com/avatar/batchsyntheses?skip=0&maxpagesize=2&api-version=2024-08-01" -H "Ocp-Apim-Subscription-Key: YourSpeechKey"
You receive a response body in the following format:
{
"value": [
{
"id": "my-job-02",
"internalId": "14c25fcf-3cb6-4f46-8810-ecad06d956df",
"status": "Succeeded",
"createdDateTime": "2024-03-06T07:52:23.9054709Z",
"lastActionDateTime": "2024-03-06T07:52:29.3416944",
"inputKind": "SSML",
"customVoices": {},
"properties": {
"timeToLiveInHours": 744,
"sizeInBytes": 502676,
"durationInMilliseconds": 2950,
"succeededCount": 1,
"failedCount": 0,
"billingDetails": {
"neuralCharacters": 32,
"talkingAvatarDurationSeconds": 2
}
},
"avatarConfig": {
"talkingAvatarCharacter": "lisa",
"talkingAvatarStyle": "casual-sitting",
"videoFormat": "Mp4",
"videoCodec": "h264",
"subtitleType": "soft_embedded",
"bitrateKbps": 2000,
"customized": false
},
"outputs": {
"result": "https://stttssvcprodusw2.blob.core.windows.net/batchsynthesis-output/xxxxx/xxxxx/0001.mp4?SAS_Token",
"summary": "https://stttssvcprodusw2.blob.core.windows.net/batchsynthesis-output/xxxxx/xxxxx/summary.json?SAS_Token"
}
},
{
"id": "my-job-01",
"internalId": "5a25b929-1358-4e81-a036-33000e788c46",
"status": "Succeeded",
"createdDateTime": "2024-03-06T07:34:08.9487009Z",
"lastActionDateTime": "2024-03-06T07:34:12.5698769",
"inputKind": "SSML",
"customVoices": {},
"properties": {
"timeToLiveInHours": 744,
"sizeInBytes": 344460,
"durationInMilliseconds": 2520,
"succeededCount": 1,
"failedCount": 0,
"billingDetails": {
"neuralCharacters": 29,
"talkingAvatarDurationSeconds": 2
}
},
"avatarConfig": {
"talkingAvatarCharacter": "lisa",
"talkingAvatarStyle": "graceful-sitting",
"videoFormat": "Mp4",
"videoCodec": "hevc",
"subtitleType": "soft_embedded",
"bitrateKbps": 2000,
"customized": false
},
"outputs": {
"result": "https://stttssvcprodusw2.blob.core.windows.net/batchsynthesis-output/xxxxx/xxxxx/0001.mp4?SAS_Token",
"summary": "https://stttssvcprodusw2.blob.core.windows.net/batchsynthesis-output/xxxxx/xxxxx/summary.json?SAS_Token"
}
}
],
"nextLink": "https://YourSpeechRegion.api.cognitive.microsoft.com/avatar/batchsyntheses/?api-version=2024-08-01&skip=2&maxpagesize=2"
}
From outputs.result
, you can download a video file containing the avatar video. From outputs.summary
, you can access the summary and debug details. For more information, see batch synthesis results.
The value
property in the JSON response lists your synthesis requests. The list is paginated, with a maximum page size of 100. The nextLink
property is provided as needed to get the next page of the paginated list.
Get batch synthesis results file
Once you get a batch synthesis job with status
of "Succeeded", you can download the video output results. Use the URL from the outputs.result
property of the get batch synthesis response.
To get the batch synthesis results file, make an HTTP GET request using the URI as shown in the following example. Replace YourOutputsResultUrl
with the URL from the outputs.result
property of the get batch synthesis response. Replace YourSpeechKey
with your Speech resource key.
curl -v -X GET "YourOutputsResultUrl" -H "Ocp-Apim-Subscription-Key: YourSpeechKey" > output.mp4
To get the batch synthesis summary file, make an HTTP GET request using the URI as shown in the following example. Replace YourOutputsResultUrl
with the URL from the outputs.summary
property of the get batch synthesis response. Replace YourSpeechKey
with your Speech resource key.
curl -v -X GET "YourOutputsSummaryUrl" -H "Ocp-Apim-Subscription-Key: YourSpeechKey" > summary.json
The summary file contains the synthesis results for each text input. Here's an example summary.json file:
{
"jobID": "5a25b929-1358-4e81-a036-33000e788c46",
"status": "Succeeded",
"results": [
{
"texts": [
"<speak version='1.0' xml:lang='en-US'><voice name='en-US-AvaMultilingualNeural'>The rainbow has seven colors.</voice></speak>"
],
"status": "Succeeded",
"videoFileName": "244a87c294b94ddeb3dbaccee8ffa7eb/5a25b929-1358-4e81-a036-33000e788c46/0001.mp4",
"TalkingAvatarCharacter": "lisa",
"TalkingAvatarStyle": "graceful-sitting"
}
]
}
Delete batch synthesis
After you have retrieved the audio output results and no longer need the batch synthesis job history, you can delete it. The Speech service retains each synthesis history for up to 31 days or the duration specified by the request's timeToLiveInHours
property, whichever comes sooner. The date and time of automatic deletion, for synthesis jobs with a status of "Succeeded" or "Failed" is calculated as the sum of the lastActionDateTime
and timeToLive
properties.
To delete a batch synthesis job, make an HTTP DELETE request using the following URI format. Replace YourSynthesisId
with your batch synthesis ID, YourSpeechKey
with your Speech resource key, and YourSpeechRegion
with your Speech resource region.
curl -v -X DELETE "https://YourSpeechRegion.api.cognitive.microsoft.com/avatar/batchsyntheses/YourSynthesisId?api-version=2024-08-01" -H "Ocp-Apim-Subscription-Key: YourSpeechKey"
The response headers include HTTP/1.1 204 No Content
if the delete request was successful.