Endpoint with custom model returns different result to Speech Studio

van Boheemen, Matthew 1

I have created a custom model in Speech Studio that uses sample text and structured text. I have uploaded some test samples into Speech Studio and have tested the model against these samples.

I then deployed the custom model as an endpoint and am calling this from a C# application using the C# SDK. The results I receive are different in Speech Studio compared to what I receive from the endpoint. They are considerably worse from the endpoint.

For example, I get the results below in Speech Studio:

User's image When I run the exact same audio file through my application I get the following:

XXXX 3904 Turn heading 140

I'm not sure if it is because of the number conversion, but there are other instances where the text results are quite different.

User's image

Produces:

XXXX 184 descend to flight of all 320

The display results that I see in Speech Studio are pretty good and if I received them in my application (via the endpoint) then I would be happy.

I am currently using the code below:

var config = SpeechConfig.FromSubscription(subscriptionKey, region);

config.EndpointId = endpointId;

// Create an audio configuration

var audioConfig = AudioConfig.FromWavFileInput(file);

// Create a speech recognizer

using var recognizer = new SpeechRecognizer(config, audioConfig);

// Start recognition

var result = await recognizer.RecognizeOnceAsync();

// Check result

if (result.Reason == ResultReason.RecognizedSpeech)

{

return result.Text;

}

return null;

What can I do to get the same results as the Display Result from Speech Studio to be returned via the endpoint?

dupammi 6,810 Reputation points Microsoft Vendor

2024-04-12T07:27:41.1566667+00:00
Hi @van Boheemen, Matthew

Thank you for using the Microsoft Q&A forum.

Based on your description, it seems like the results you are receiving from the endpoint are considerably worse than the results you are seeing in Speech Studio. One possible reason for this could be the difference in the audio format used in Speech Studio and the audio format used in your C# application.

To ensure that the audio format used in your C# application is the same as the one used in Speech Studio. As shown in this example, you can Customize audio format, including the audio file type, sample rate, and bit depth in your C# application as well.

You can use the set_speech_synthesis_output_format() function on the SpeechConfig object to change the audio format. This function expects an enum instance of type SpeechSynthesisOutputFormat. Use the enum to select the output format. For available formats, see the list of audio formats.

Here's an example of how you can set the audio format to Riff24Khz16BitMonoPcm:

speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm)

You can also use SSML to fine-tune the pitch, pronunciation, speaking rate, volume, and other aspects in the text to speech output by submitting your requests from an XML schema. For more information, see Speech Synthesis Markup Language overview.

Please also look at a similar thread here. It could help further in the analysis of your use case.

I hope this helps. Thank you.
van Boheemen, Matthew 1 Reputation point

2024-04-12T08:53:49.0333333+00:00

My understanding was that that field was related to text to speech. I am looking for speech recognition (i.e. speech to text). Is this field still going to be relevant for speech to text?
dupammi 6,810 Reputation points Microsoft Vendor

2024-04-12T11:34:36.3666667+00:00

Hi @van Boheemen, Matthew

Thank you for your response.

Please set the audio format explicitly in your C# code using the AudioConfig object. Additionally, you can adjust the speech recognition acoustic parameters etc. in your code to match the ones used in Speech Studio.

After matching the C# params with the studio ones, if the problem still persists, I request you to raise a support case through Azure portal.

I hope you understand. Thanks.
van Boheemen, Matthew 1 Reputation point

2024-04-12T23:01:58.47+00:00

Thanks for the response.

How do I find the audio format and speech recognition parameters that speech studio is using? I don't see any information about that in the documentation or any details related to that in speech studio.
dupammi 6,810 Reputation points Microsoft Vendor

2024-04-13T02:35:15.1366667+00:00

Hi @van Boheemen, Matthew

You can refer to the training and testing datasets documentation. The documentation provides details on the types of training and testing data that you can use for custom speech, including audio-only, audio + human-labeled transcripts, plain text, structured text, and pronunciation data etc.

If the problem still persists, I request you to raise a support case through Azure portal.

Thank you.
van Boheemen, Matthew 1 Reputation point

2024-04-13T04:19:25.8166667+00:00

Thanks.

I have done training of a custom model using training data. My understanding was that once a custom model was created if the custom model is used through the endpoint or through Speech Studio it would produce the same results.

Is this the expectation? If so, then I think will raise a support case.