Any way to use a custom speech-to-text model with pronunciation assessment?

Question

Any way to use a custom speech-to-text model with pronunciation assessment?

Amanda 30

Hello,

I have trained a custom speech-to-text model with some data to improve the recognition of disfluencies and hesitation markers like "um" and "uh", and it works pretty well. I also would like to get the pronunciation assessment results for audios that I send to this model, is it possible in some way? I can get the pronunciation results (IPA phoneme-level) using the default model but need it to match up with the transcript from my custom model. It would be ideal to get both at once. FYI I only really need the timestamped phonemes from pronunciation assessment output, I do not need any of the accuracy-related scores.

Thank you!

Jerald Felix 1,475 Reputation points

2025-06-23T15:45:30.35+00:00
Hello Amanda,

You cannot use a custom speech-to-text model from Azure Speech with the Azure OpenAI Whisper endpoint. The Whisper endpoint only supports the pre-trained Whisper models (no custom vocab or domain fine‑tuning)

What you can do:

Use Azure AI Speech Service for custom models

If you want to leverage your own vocabulary, acoustic or language models, go through the Azure Speech Service's custom speech feature. This supports real-time transcription, batch transcription, and custom vocabulary tuning using the Speech SDK, REST APIs, or CLI

Use Azure OpenAI Whisper endpoint for general transcription

If you just need general-purpose speech-to-text:

client = OpenAIClient(endpoint, credential)

audio_client = client.get_audio_client(deployment_id)

result = await audio_client.transcribe_audio_async(stream, filename, opts)

Whisper excels at robust, multilingual transcription, but does not support fine-tuning or custom speech models.

Many Thanks!

Best Regards,

Jerald Felix
Amanda 30 Reputation points

2025-06-23T15:54:21.0033333+00:00

Hi Jerald,

Thanks for your response. However, I am not using Whisper models or endpoints at all, and do not intend to. I am only using a custom model that I trained myself with Azure AI speech service. I would like to use it with the Pronunciation Assessment feature, but cannot figure out how to do that. Do you have any insight on this?

Thanks,
Amanda
Jerald Felix 1,475 Reputation points

2025-06-23T16:00:22.5233333+00:00

Hi Amanda,

Thanks for clarifying! Since you're using a custom model trained with Azure AI Speech, and looking to combine it with Pronunciation Assessment, here’s what you need to know:

Pronunciation Assessment is currently only supported when using the standard base models provided by Azure not with custom-trained acoustic or language models.

This means if you've created a custom model through Custom Speech (CRIS), it cannot be used directly with the Pronunciation Assessment API at this time. This limitation is noted in Microsoft’s official docs under feature constraints.

Workaround Suggestion:

If you want to use Pronunciation Assessment, the current workaround is to:

Use the standard Speech-to-Text model for pronunciation evaluation.

Use your custom model separately where domain-specific accuracy is more critical.

Hopefully, Microsoft will extend support for Pronunciation Assessment with custom models in the future. Let me know if you’d like help setting up the assessment using the standard model pipeline!

Best,

Jerald
Amanda 30 Reputation points

2025-06-23T16:11:03.62+00:00

Thanks Jerald,

I couldn't find where the limitation was noted in the docs that you linked, there was no feature constraints section. Could you point out more specifically where I can find it?

-Amanda
Jerald Felix 1,475 Reputation points

2025-06-23T16:18:50.2266667+00:00

Hi Amanda,

AI generated content. This answer was created with AI from ChatGPT.

Thanks for pointing that out — you're absolutely right to ask for clarification!

While the documentation does not explicitly mention a section titled “feature constraints” or limitations regarding pronunciation assessment, it does provide information about supported languages and regions, which can imply certain limitations.

For example, it notes that prosody assessment is only available in the en-US locale, indicating that not all features are available for every language. Additionally, it mentions that pronunciation assessment is not available with the Speech SDK for Go, which further highlights potential constraints on feature availability.

source:

Pronunciation Assessment (Azure AI Speech) – Region support

This means it currently won’t work if you're using a custom-trained acoustic or language model from Custom Speech (CRIS). I agree it would definitely help if this were stated more prominently.

Best Regards,

Jerald Felix
Amanda 30 Reputation points

2025-06-23T16:24:13.1833333+00:00

Your link does not connect to Region Support, and even after I found the page myself I could not find that quote. I text searched the entire page and it wasn't found. Sorry but your responses sound like they are from an LLM, I have conversations with ChatGPT that sound exactly like this :(
Jerald Felix 1,475 Reputation points

2025-06-23T16:34:41.9766667+00:00

I have updated my previous response. Kindly check it
TP 124.7K Reputation points Volunteer Moderator

2025-06-23T19:15:34.46+00:00
@Jerald Felix Please update your answers/comments (to this question and others) to comply with AI Usage policy for Microsoft Q&A.

For example, you could start by clearly stating at the top of each answer that it is generated in whole/part by AI and detail which AI was used, similar to below:

AI generated content. This answer was created with AI from ChatGPT

Additionally, please make sure you comply with the other key elements of the policy, excerpted below:

Follow these steps when using AI to generate an answer on Q&A:

Mention the AI service name that generated the answer, fully or partly.

Check the AI output accuracy and relevance and adjust it as needed. Indicate you validated or updated the AI output in your answer.

Include any sources AI generated or you used as you validated and updated the AI output.

Using AI to help with answers on Microsoft Q&A requires following these three steps. Otherwise, moderators might delete the content and suspend your account.

Thank you.
Jerald Felix 1,475 Reputation points

2025-06-24T00:37:39.87+00:00

@tp tp Thanks for sharing the guidelines I have updated the answer accordingly

Accepted answer

0 additional answers

Your answer

Jerald Felix 1,475 Reputation points

2025-06-23T15:45:30.35+00:00

Hello Amanda,

You cannot use a custom speech-to-text model from Azure Speech with the Azure OpenAI Whisper endpoint. The Whisper endpoint only supports the pre-trained Whisper models (no custom vocab or domain fine‑tuning)

What you can do:

Use Azure AI Speech Service for custom models

If you want to leverage your own vocabulary, acoustic or language models, go through the Azure Speech Service's custom speech feature. This supports real-time transcription, batch transcription, and custom vocabulary tuning using the Speech SDK, REST APIs, or CLI

Use Azure OpenAI Whisper endpoint for general transcription

If you just need general-purpose speech-to-text:

client = OpenAIClient(endpoint, credential)

audio_client = client.get_audio_client(deployment_id)

result = await audio_client.transcribe_audio_async(stream, filename, opts)

Whisper excels at robust, multilingual transcription, but does not support fine-tuning or custom speech models.

Many Thanks!

Best Regards,

Jerald Felix
Amanda 30 Reputation points

2025-06-23T15:54:21.0033333+00:00

Hi Jerald,

Thanks for your response. However, I am not using Whisper models or endpoints at all, and do not intend to. I am only using a custom model that I trained myself with Azure AI speech service. I would like to use it with the Pronunciation Assessment feature, but cannot figure out how to do that. Do you have any insight on this?

Thanks,
Amanda
Jerald Felix 1,475 Reputation points

2025-06-23T16:00:22.5233333+00:00

Hi Amanda,

Thanks for clarifying! Since you're using a custom model trained with Azure AI Speech, and looking to combine it with Pronunciation Assessment, here’s what you need to know:

Pronunciation Assessment is currently only supported when using the standard base models provided by Azure not with custom-trained acoustic or language models.

This means if you've created a custom model through Custom Speech (CRIS), it cannot be used directly with the Pronunciation Assessment API at this time. This limitation is noted in Microsoft’s official docs under feature constraints.

Workaround Suggestion:

If you want to use Pronunciation Assessment, the current workaround is to:

Use the standard Speech-to-Text model for pronunciation evaluation.

Use your custom model separately where domain-specific accuracy is more critical.

Hopefully, Microsoft will extend support for Pronunciation Assessment with custom models in the future. Let me know if you’d like help setting up the assessment using the standard model pipeline!

Best,

Jerald
Amanda 30 Reputation points

2025-06-23T16:11:03.62+00:00

Thanks Jerald,

I couldn't find where the limitation was noted in the docs that you linked, there was no feature constraints section. Could you point out more specifically where I can find it?

-Amanda
Jerald Felix 1,475 Reputation points

2025-06-23T16:18:50.2266667+00:00

Hi Amanda,

AI generated content. This answer was created with AI from ChatGPT.

Thanks for pointing that out — you're absolutely right to ask for clarification!

While the documentation does not explicitly mention a section titled “feature constraints” or limitations regarding pronunciation assessment, it does provide information about supported languages and regions, which can imply certain limitations.

For example, it notes that prosody assessment is only available in the en-US locale, indicating that not all features are available for every language. Additionally, it mentions that pronunciation assessment is not available with the Speech SDK for Go, which further highlights potential constraints on feature availability.

source:

Pronunciation Assessment (Azure AI Speech) – Region support

This means it currently won’t work if you're using a custom-trained acoustic or language model from Custom Speech (CRIS). I agree it would definitely help if this were stated more prominently.

Best Regards,

Jerald Felix
Amanda 30 Reputation points

2025-06-23T16:24:13.1833333+00:00

Your link does not connect to Region Support, and even after I found the page myself I could not find that quote. I text searched the entire page and it wasn't found. Sorry but your responses sound like they are from an LLM, I have conversations with ChatGPT that sound exactly like this :(
Jerald Felix 1,475 Reputation points

2025-06-23T16:34:41.9766667+00:00

I have updated my previous response. Kindly check it
TP 124.7K Reputation points Volunteer Moderator

2025-06-23T19:15:34.46+00:00

@Jerald Felix Please update your answers/comments (to this question and others) to comply with AI Usage policy for Microsoft Q&A.

For example, you could start by clearly stating at the top of each answer that it is generated in whole/part by AI and detail which AI was used, similar to below:

AI generated content. This answer was created with AI from ChatGPT

Additionally, please make sure you comply with the other key elements of the policy, excerpted below:

Follow these steps when using AI to generate an answer on Q&A:

Mention the AI service name that generated the answer, fully or partly.

Check the AI output accuracy and relevance and adjust it as needed. Indicate you validated or updated the AI output in your answer.

Include any sources AI generated or you used as you validated and updated the AI output.

Using AI to help with answers on Microsoft Q&A requires following these three steps. Otherwise, moderators might delete the content and suspend your account.

Thank you.
Jerald Felix 1,475 Reputation points

2025-06-24T00:37:39.87+00:00

@tp tp Thanks for sharing the guidelines I have updated the answer accordingly

Answer 1

Hi Amanda,

I don't think we can use pronunciation assessment with custom models

Below example script uses only an endpoint and key. Not sure whether we can use Custom Speech model endpoint in SDK here as there is no SDK support for custom speech model yet. (Only CLI, Rest API and Portal)

https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/scenarios/python/console/language-learning/pronunciation_assessment.py

config = speechsdk.SpeechConfig(subscription=speech_key, endpoint=speech_endpoint)

Reference - https://docs.azure.cn/en-us/ai-services/speech-service/how-to-custom-speech-deploy-model?pivots=speech-studio

Thank you

Share via

Any way to use a custom speech-to-text model with pronunciation assessment?

0 additional answers

Your answer