Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and isn't recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
MAI-Voice is a family of neural text-to-speech models available through Azure Speech in Foundry Tools in public preview. Built on Microsoft's in-house speech foundation models, MAI-Voice models produce expressive, natural speech output with consistent voice persona quality. Similar to Azure Neural HD voices, MAI-Voice models understand input text holistically and automatically adapt tone, emotion, and speaking style. This adaptation enables more human-like and conversational speech without requiring extensive manual tuning.
Speech offers the following MAI-Voice models:
| Model | Voice Count | Key Characteristics | Best For |
|---|---|---|---|
| MAI-Voice-1 | Six prebuilt English (US) voices | Emotionally rich, highly expressive, consistent persona quality, SSML style control | Conversational AI, creative applications, long-form narration |
| MAI-Voice-2 | Multilingual prebuilt voices across 10+ languages | High-fidelity expressive synthesis, multilingual, voice prompting (gated), long-form and multi-speaker generation | Multilingual conversational AI, expressive long-form content, multi-speaker scenarios |
MAI-Voice-1
MAI-Voice-1 is optimized for expressive, conversational, and long-form scenarios in English (US).
Key features
| Key features | Description |
|---|---|
| Human-like speech generation | MAI-Voice-1 generates highly natural and emotionally rich speech. The model interprets input text holistically and automatically adjusts emotion, pace, and rhythm without manual configuration. |
| Conversational expressiveness | MAI-Voice-1 is optimized for conversational scenarios, producing engaging and context-aware speech suitable for assistants and interactive experiences. |
| Emotion and style control | Developers can influence speaking style by using SSML with mstts:express-as, enabling control over emotions such as joy, excitement, empathy, and more. |
| Consistent voice persona | MAI-Voice-1 maintains a stable and consistent voice persona across long-form content while still allowing expressive variation. |
| High fidelity audio | The model produces high-quality neural speech with natural prosody and clarity suitable for production-grade applications. |
| Real-time synthesis | MAI-Voice-1 supports real-time speech synthesis by using the Speech SDK and APIs. |
Prerequisites
- An Azure account. Create one for free.
- A Speech resource in a region that supports MAI-Voice-1 (region support).
Use MAI-Voice-1
MAI-Voice-1 uses the same Azure Speech SDKs and APIs as other Azure Neural and HD voices. Follow the Text to speech quickstart in the platform of your choice. Use the speech synthesis method that incorporates SSML specification, and enter one of the available MAI-Voice-1 prebuilt voices in the name attribute of the <voice> element.
For example, the following Python code synthesizes speech using en-us-Teo:MAI-Voice-1 and saves it to output.mp3. Replace <key> with your Speech resource key.
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription="<key>", region="eastus"
)
audio_config = speechsdk.audio.AudioOutputConfig(filename="output.mp3")
speech_config.set_speech_synthesis_output_format(
speechsdk.SpeechSynthesisOutputFormat.Audio24Khz160KBitRateMonoMp3
)
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config, audio_config=audio_config
)
ssml = """
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
<voice name='en-us-Jasper:MAI-Voice-1'>
<mstts:express-as style="excitement">hello world.</mstts:express-as>
</voice>
</speak>
"""
synthesizer.speak_ssml_async(ssml).get()
On success, an output.mp3 file containing the synthesized speech is saved to the current directory.
Reference: SpeechConfig | AudioOutputConfig | SpeechSynthesizer | speak_ssml_async
SSML examples
Basic SSML
The following SSML synthesizes a greeting using the en-us-Noa:MAI-Voice-1 voice.
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='en-US-Jasper:MAI-Voice-1'>
<mstts:express-as style="excitement">hello world.</mstts:express-as>
</voice>
</speak>
Submit this SSML to the Speech REST API or SDK to receive synthesized audio.
Reference: Speech Synthesis Markup Language (SSML) | <voice> element
Personal Voice (MAI-voice-1 prompt mode)
Steps to Access:
To access personal voice (voice cloning) using MAI-Voice-1:
- Apply for gated access via Azure AI Custom Neural Voice and Custom Avatar Limited Access Review.
- Once approved, access personal voice APIs at cognitive-services-speech-sdk/samples/custom-voice.
- Upload audio consent and prompt to create a personal voice.
- Synthesize given text using the created voice and MAI-Voice-1 model using the following SSML:
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='MAI-voice-1'>
<mstts:ttsembedding speakerProfileId='your speaker profile ID here'>
I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun.
</mstts:ttsembedding>
</voice>
</speak>
Prebuilt voices
| Voice ID | Gender | Recommended use case |
|---|---|---|
| en-us-Jasper:MAI-Voice-1 | Male | General Conversation, Sales, Emotional styles |
| en-us-June:MAI-Voice-1 | Female | General Conversation, Customer Service, Professional, Emotional styles |
| en-us-Grant:MAI-Voice-1 | Male | General Conversation, Professional, Emotional styles |
| en-us-Iris:MAI-Voice-1 | Female | General Conversation, Narration, Emotional styles |
| en-us-Reed:MAI-Voice-1 | Male | General Conversation |
| en-us-Joy:MAI-Voice-1 | Female | General Conversation |
Usage: Available for third-party developers. Microsoft holds full licensing rights for commercial use.
MAI-Voice-2
MAI-Voice-2 is a high-fidelity, expressive, prompted text-to-speech model that supports multilingual synthesis across more than 10 languages. It extends the MAI-Voice family with multilingual coverage, voice prompting (gated), long-form generation, and mult-speaker generation.
Key features
| Key features | Description |
|---|---|
| High-fidelity natural synthesis | Produces highly natural voice output with expressive control. |
| Multilingual support | Supports synthesis across more than 10 languages with locale-specific prebuilt voices. |
| Expressive SSML control | Supports mstts:express-as with style and styledegree for fine-grained expressive control (for example, happiness). |
| Voice prompting (gated) | Supports voice prompting with short reference clips (10–120 seconds), subject to gated access approval and consent safeguards. |
| Long-form generation | Optimized for long-form narration with stable persona quality across extended content. |
| Mult-speaker generation | Supports mult-speaker scenarios within a single synthesis flow. |
| Out-of-scope note | The model prioritizes naturalness and expressivity over latency-critical scenarios. |
Prerequisites
- An Azure account. Create one for free.
- Create a Speech resource in a region that supports MAI-Voice-2.
- For voice prompting, apply for limited access approval and complete consent safeguards.
Use MAI-Voice-2
You can access MAI-Voice-2 through the Azure Speech REST API. Send an SSML POST request to the cognitiveservices/v1 endpoint of your Speech resource, and place the desired MAI-Voice-2 voice in the name attribute of the <voice> element.
The following Python example sends an SSML request to the REST endpoint and writes the resulting 24 kHz MP3 to disk. You can authenticate by using an API key (Ocp-Apim-Subscription-Key) or Entra ID (Authorization: Bearer ...).
import os
from pathlib import Path
import requests
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
load_dotenv('deployment.env', override=True)
VOICE2_ENDPOINT = os.getenv('MAI_VOICE_2_ENDPOINT', 'https://eastus.tts.speech.microsoft.com/')
VOICE2_KEY = os.getenv('MAI_VOICE_2_KEY')
USE_ENTRA_AUTH = os.getenv('USE_ENTRA_AUTH', 'true').lower() == 'true' or not VOICE2_KEY
OUT_DIR = Path('./audio')
OUT_DIR.mkdir(parents=True, exist_ok=True)
token_provider = None
if USE_ENTRA_AUTH:
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
'https://cognitiveservices.azure.com/.default',
)
def headers() -> dict:
h = {
'Content-Type': 'application/ssml+xml',
'X-Microsoft-OutputFormat': 'audio-24khz-160kbitrate-mono-mp3',
'User-Agent': 'mai-voice-2-sample',
}
if USE_ENTRA_AUTH:
h['Authorization'] = f"Bearer {token_provider()}"
else:
h['Ocp-Apim-Subscription-Key'] = VOICE2_KEY
return h
def synthesize_to_file(ssml: str, out_file: str) -> Path:
url = f"{VOICE2_ENDPOINT.rstrip('/')}/cognitiveservices/v1"
resp = requests.post(url, headers=headers(), data=ssml.encode('utf-8'), timeout=180)
resp.raise_for_status()
p = OUT_DIR / out_file
p.write_bytes(resp.content)
return p
ssml = """<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-Harper:MAI-Voice-2">
Hello, this is a sample from MAI Voice 2.
</voice>
</speak>"""
synthesize_to_file(ssml, 'mai_voice2_en.mp3')
On success, the mai_voice2_en.mp3 file containing the synthesized speech is saved to the output directory.
SSML examples
Basic multilingual SSML
The following SSML synthesizes a greeting in Spanish (Mexico) by using es-MX-Valeria:MAI-Voice-2.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="es-MX">
<voice name="es-MX-Valeria:MAI-Voice-2">
Hola, esta es una muestra de MAI Voice 2.
</voice>
</speak>
Expressive control with mstts:express-as
MAI-Voice-2 supports expressive styles by using style and styledegree attributes for fine-grained control:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-Harper:MAI-Voice-2">
<mstts:express-as style="happiness" styledegree="1.2">
Welcome to Microsoft Build. MAI Voice 2 supports multilingual expressive synthesis.
</mstts:express-as>
</voice>
</speak>
Voice prompting (gated access)
Voice prompting (personal voice cloning) by using MAI-Voice-2 is gated and requires Microsoft approval plus consent safeguards.
Steps to access:
- Apply for limited access approval through Azure AI Custom Neural Voice and Custom Avatar Limited Access Review.
- Upload consent audio and reference prompt (10–120 seconds).
- Use the Personal Voice APIs to create the voice profile.
- Synthesize by using the approved voice profile and MAI-Voice-2 model.
Prebuilt voices
MAI-Voice-2 provides locale-specific prebuilt voices across multiple languages.
| Voice Name (ShortName) | Locale | Language | Gender | Supported Styles |
|---|---|---|---|---|
| de-DE-Klaus:MAI-Voice-2 | de-DE | German (Germany) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| de-DE-Mia:MAI-Voice-2 | de-DE | German (Germany) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| en-AU-Lisa:MAI-Voice-2 | en-AU | English (Australia) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| en-US-Ethan:MAI-Voice-2 | en-US | English (United States) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| en-US-Grant:MAI-Voice-2 | en-US | English (United States) | Male | — |
| en-US-Harper:MAI-Voice-2 | en-US | English (United States) | Female | angry, confused, determined, embarrassed, excited, happy, hopeful, joyful, regretful, relieved, sad, shouting, softvoice, whispering |
| en-US-Iris:MAI-Voice-2 | en-US | English (United States) | Female | — |
| en-US-Jasper:MAI-Voice-2 | en-US | English (United States) | Male | — |
| en-US-Olivia:MAI-Voice-2 | en-US | English (United States) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| es-ES-Marta:MAI-Voice-2 | es-ES | Spanish (Spain) | Female | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| es-MX-Alejo:MAI-Voice-2 | es-MX | Spanish (Mexico) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| es-MX-Valeria:MAI-Voice-2 | es-MX | Spanish (Mexico) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| fr-FR-Marc:MAI-Voice-2 | fr-FR | French (France) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| fr-FR-Soleil:MAI-Voice-2 | fr-FR | French (France) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| hi-IN-Arjun:MAI-Voice-2 | hi-IN | Hindi (India) | Male | angry, confused, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, sad, surprised |
| hi-IN-Dhruv:MAI-Voice-2 | hi-IN | Hindi (India) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| hi-IN-Kavya:MAI-Voice-2 | hi-IN | Hindi (India) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| hi-IN-Priya:MAI-Voice-2 | hi-IN | Hindi (India) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| hu-HU-Bence:MAI-Voice-2 | hu-HU | Hungarian (Hungary) | Male | — |
| hu-HU-Levente:MAI-Voice-2 | hu-HU | Hungarian (Hungary) | Male | — |
| hu-HU-Lilla:MAI-Voice-2 | hu-HU | Hungarian (Hungary) | Female | — |
| hu-HU-Réka:MAI-Voice-2 | hu-HU | Hungarian (Hungary) | Female | — |
| it-IT-Luca:MAI-Voice-2 | it-IT | Italian (Italy) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| it-IT-Rosa:MAI-Voice-2 | it-IT | Italian (Italy) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| ko-KR-Hana:MAI-Voice-2 | ko-KR | Korean (Korea) | Female | angry, confused, determined, embarrassed, excited, happy, hopeful, joyful, regretful, relieved, sad, softvoice, surprised |
| ko-KR-Junho:MAI-Voice-2 | ko-KR | Korean (Korea) | Male | angry, confused, determined, embarrassed, excited, happy, hopeful, joyful, relieved, sad, softvoice |
| nl-NL-Fleur:MAI-Voice-2 | nl-NL | Dutch (Netherlands) | Female | — |
| nl-NL-Sander:MAI-Voice-2 | nl-NL | Dutch (Netherlands) | Male | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| pt-BR-Caio:MAI-Voice-2 | pt-BR | Portuguese (Brazil) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| pt-BR-Luana:MAI-Voice-2 | pt-BR | Portuguese (Brazil) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| pt-BR-Pedro:MAI-Voice-2 | pt-BR | Portuguese (Brazil) | Male | confused, determined, embarrassed, excited, happy, hopeful, joyful, regretful, relieved, sad, softvoice, surprised |
| pt-BR-Rafael:MAI-Voice-2 | pt-BR | Portuguese (Brazil) | Male | angry, confused, determined, embarrassed, excited, happy, hopeful, joyful, regretful, relieved, sad, softvoice, surprised |
| pt-PT-Rui:MAI-Voice-2 | pt-PT | Portuguese (Portugal) | Male | angry, confused, determined, embarrassed, excited, happy, hopeful, joyful, regretful, relieved, sad, softvoice, surprised |
| ro-RO-Andrei:MAI-Voice-2 | ro-RO | Romanian (Romania) | Male | — |
| ro-RO-Elena:MAI-Voice-2 | ro-RO | Romanian (Romania) | Female | — |
| ro-RO-Ioana:MAI-Voice-2 | ro-RO | Romanian (Romania) | Female | — |
| ro-RO-Radu:MAI-Voice-2 | ro-RO | Romanian (Romania) | Male | — |
| ru-RU-Lev:MAI-Voice-2 | ru-RU | Russian (Russia) | Male | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| ru-RU-Masha:MAI-Voice-2 | ru-RU | Russian (Russia) | Female | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| th-TH-Krit:MAI-Voice-2 | th-TH | Thai (Thailand) | Male | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| th-TH-Nattapong:MAI-Voice-2 | th-TH | Thai (Thailand) | Male | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| tr-TR-Aydin:MAI-Voice-2 | tr-TR | Turkish (Turkey) | Male | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| tr-TR-Elif:MAI-Voice-2 | tr-TR | Turkish (Turkey) | Female | adventurous, caring, empathy, curious, encouraging, excited, friendly, cheerful, nostalgic, reflective, sad, disappointed, serious |
| zh-CN-Bo:MAI-Voice-2 | zh-CN | Chinese (Mandarin, Simplified) | Male | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
| zh-CN-Lan:MAI-Voice-2 | zh-CN | Chinese (Mandarin, Simplified) | Female | angry, confused, disgusted, embarrassed, excited, fearful, happy, joyful, sad, surprised |
| zh-CN-Mei:MAI-Voice-2 | zh-CN | Chinese (Mandarin, Simplified) | Female | angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering |
Note
The voices listed in the preceding table are the currently published MAI-Voice-2 prebuilt voices. The model card indicates support across 10+ languages. Microsoft adds more locales and voices as they become generally available.
Usage: Available for third-party developers. Microsoft holds full licensing rights for commercial use.