Who can provide assistance?The time required for speech to text processing on the same file varies greatly, with a maximum of around 40%. Is Azure's performance like this?

Question

Who can provide assistance?The time required for speech to text processing on the same file varies greatly, with a maximum of around 40%. Is Azure's performance like this?

连博10335043 65

Like the annex

The first test ,it took approximately 8.5 seconds.first_test_log.txt

But,it only took approximately 5 seconds for the second test.Second_test_log.txt

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T04:47:06.39+00:00
@连博10335043 Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Firstly I see that you are checking the latency using the network traces. This is not the right way to check the latency.

Please rely on the latency metrics and also Apply splitting based on the API Name and Operation Name as shown below:

Suggestion:

If your concern is with respect to the latency, then you should leverage the fast transcription API.

.

Fast transcription: Fastest synchronous output for situations with predictable latency.

Fast transcription API is used to transcribe audio files with returning results synchronously and faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:

Quick audio or video transcription and subtitles: Quickly get a transcription of an entire video or audio file in one go.

Video translation: Immediately get new subtitles for a video if you have audio in different languages.

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
连博10335043 65 Reputation points

2024-08-01T07:25:56.8733333+00:00

Thank you.

I checked it. found that the latency is within 600ms,but I always receive results a long time later.

Can you tell me the meaning of the latency in the picture above
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T07:32:53.7966667+00:00
@连博10335043 Thanks for getting back. 600ms means that the response time is fast.

.

In Azure Speech Service, the latency metric for speech-to-text operations is primarily measured by the user-perceived latency (UPL). This is the time between when a word is spoken and when it appears in the recognition results.

Several factors can affect this latency, including:

Device specifications: CPU speed, architecture, and memory.

CPU load: Other applications running on the device can impact performance.

Memory load: The speech-to-text model consumes between 200-300 MB of memory at runtime.

Also ensure that your network speed is not slow.

More info here.
连博10335043 65 Reputation points

2024-08-01T08:01:44.5633333+00:00

Perhaps I didn't express myself clearly.

The above operation was called using the REST API format, and I uploaded a voice file. So there is no concept of time taken for words to be spoken and recognized.

I am more concerned about the latency between the final upload of the file and the return of the result.

Thank you.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T08:37:09.62+00:00

@连博10335043 Thanks for clarifying. Please test using Speech Studio and check if you encounter the same issue.

Also please let me know how are you calculating the latency? Like I mentioned about using Network monitor traces is not a valid test.
连博10335043 65 Reputation points

2024-08-01T08:58:25.7833333+00:00

......................
连博10335043 65 Reputation points

2024-08-01T09:02:17.3833333+00:00

In fact, the delay felt in Speech Studio is not very noticeable.

The basic idea for how I measure latency is to use the measurement in the following example code.https://github.com/Azure-Samples/Cognitive-Speech-TTS/blob/master/PronunciationAssessment/Python/sample.py

Calculate the time difference between uploading the last packet of data and returning the final result.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T09:04:45.6266667+00:00

@连博10335043 Please confirm if you are using the above python sample or postman ?

May I know which region is your Azure Speech Service deployed to ?

Please share the sample audio file which you are using ? Is that a .WAV file ? You may have to rename it to .txt and share it here.

Please share your resource URI of the speech service in below format:

/ subscriptions/XXXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXXX

Share the above details in private message.
连博10335043 65 Reputation points

2024-08-01T09:04:49.5366667+00:00

In fact, the latency felt in Speech Studio is not very noticeable.

The basic idea for how I measure latency is to use the measurement in the following example code.https://github.com/Azure-Samples/Cognitive-Speech-TTS/blob/master/PronunciationAssessment/Python/sample.py

Calculate the time difference between uploading the last packet of data and returning the final result.
连博10335043 65 Reputation points

2024-08-01T09:15:07.4133333+00:00
I am in China.The region of deployed is eastasia.

My test voice file is voice2024-07-30-09-30-30.txt

My resoure Url is

https://

The format is different from yours。When I make a POST request, I will add my subscriptionKey to the headers.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-05T02:24:03.9033333+00:00

@连博10335043 While performing the speech to text, Could you please share the transcription ID which is a GUID that looks like this fee14f44-1be7-4ad8-aef1-693225a3f3f1

Please share it with me. Using this, I can look at the backend logs easily.
连博10335043 65 Reputation points

2024-08-06T03:11:37.1033333+00:00

sorry.I only receive the email of the Private message. But can't find the Private message in this page.I also encountered this situation occasionally a few days ago.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-08T13:29:35.5866667+00:00

@连博10335043 Apologies for the late reply. I used the Speech to Text (for short audio) REST API sample code from this article.

I used the Chunked transfer which can help reduce recognition latency. It allows the Speech service to begin processing the audio file while it's transmitted.

.

.

I used the audio file which you had shared and I also calculated the time difference between uploading the last packet of data and returning the final result. The latency was 1.7 seconds only. See the below screenshot:

My suggestion is to use the chunked-transfer and test by using the same sample code in above article.

Hope this helps.
连博10335043 65 Reputation points

2024-08-09T01:11:32.27+00:00

Thank you.

Can you try again use this audio file voice2024-07-08-16-25-57.txt. Maybe the latency will be much longer 1.7 seconds. And please share the reason.

navba-MSFT 27,550 Microsoft Employee Moderator

@连博10335043 Thanks for getting back. I tried with this new audio file.

.

It took just 4.1 seconds. See the below screenshot:

User's image

Please try with the below code at your end and share the screenshot here.

static void Main(string[] args)
{
    string regionIdentifier = "westeurope"; // e.g., "eastus"
    string requestUri = $"https://{regionIdentifier}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US";
    string host = $"{regionIdentifier}.stt.speech.microsoft.com";
    string audioFile = "voice2024-07-08-16-25-57.wav";
    string subscriptionKey = "62c2eXXXXXXXXXXX74c65c6c";


    var request = (HttpWebRequest)HttpWebRequest.Create(requestUri);
    request.SendChunked = true;
    request.Accept = "application/json;text/xml";
    request.Method = "POST";
    request.ProtocolVersion = HttpVersion.Version11;
    request.Host = host;
    request.ContentType = "audio/wav; codecs=audio/pcm; samplerate=16000";
    request.Headers["Ocp-Apim-Subscription-Key"] = subscriptionKey;
    request.AllowWriteStreamBuffering = false;

    Stopwatch stopwatch = new Stopwatch();

    using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
    {
        byte[] buffer = null;
        int bytesRead = 0;
        using (var requestStream = request.GetRequestStream())
        {
            buffer = new byte[checked((uint)Math.Min(1024, (int)fs.Length))];
            while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
            {
                requestStream.Write(buffer, 0, bytesRead);
            }

            requestStream.Flush();
        }
    }

    stopwatch.Start(); // Start the timer after the last packet is uploaded

    try
    {
        using (var response = (HttpWebResponse)request.GetResponse())
        {
            stopwatch.Stop();

            using (var responseStream = new StreamReader(response.GetResponseStream()))
            {
                string jsonResponse = responseStream.ReadToEnd();
                Console.WriteLine(jsonResponse);
            }
        }
    }
    catch (WebException ex)
    {
        stopwatch.Stop(); // Stop the timer in case of an error
        using (var errorResponse = (HttpWebResponse)ex.Response)
        {
            using (var responseStream = new StreamReader(errorResponse.GetResponseStream()))
            {
                string errorText = responseStream.ReadToEnd();
                Console.WriteLine($"Error: {errorText}");
            }
        }
    }
    Console.WriteLine($"Time taken: {stopwatch.ElapsedMilliseconds} ms");

    Console.ReadLine();
}

Awaiting your reply.

连博10335043 65

sorry. I don't have the C# environment. I run my python code with your subscriptionKey and region. In fact, I believe that the following Python code is essentially no different from C #. the code is as follows：


import requests
import base64
import json
import time
subscriptionKey = "62c2e33bc0ee44f4836f9bff74c65c6c"
region = "westeurope"
# a common wave header, with zero audio length
# since stream data doesn't contain header, but the API requires header to fetch format information, so you need post this header as first chunk for each query
WaveHeader16K16BitMono = bytes([ 82, 73, 70, 70, 78, 128, 0, 0, 87, 65, 86, 69, 102, 109, 116, 32, 18, 0, 0, 0, 1, 0, 1, 0, 128, 62, 0, 0, 0, 125, 0, 0, 2, 0, 16, 0, 0, 0, 100, 97, 116, 97, 0, 0, 0, 0 ])
# a generator which reads audio data chunk by chunk
# the audio_source can be any audio input stream which provides read() method, e.g. audio file, microphone, memory stream, etc.
def get_chunk(audio_source, chunk_size=1024):
  yield WaveHeader16K16BitMono
  while True:
    #time.sleep(chunk_size / 32000) # to simulate human speaking rate
    chunk = audio_source.read(chunk_size)
    if not chunk:
      global uploadFinishTime
      uploadFinishTime = time.time()
      break
    yield chunk
# build request
url = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=zh-CN&profanity=masked" % region
headers = { 'Accept': 'application/json;text/xml',
            'Connection': 'Keep-Alive',
            'Content-Type': 'audio/wav; codecs=audio/pcm; samplerate=16000',
            'Ocp-Apim-Subscription-Key': subscriptionKey,
            'Transfer-Encoding': 'chunked',
            'Expect': '100-continue' }
audioFile = open("./source/voice2024-07-08-16-25-57.wav", 'rb')
# send request with chunked data
response = requests.post(url=url,data=get_chunk(audioFile), headers=headers)
getResponseTime = time.time()
audioFile.close()
resultJson = json.loads(response.text)
print(json.dumps(resultJson, indent=4))
print(resultJson["DisplayText"])
latency = getResponseTime - uploadFinishTime
print("Latency = %sms" % int(latency * 1000))

The result is following:

result

My question is as follows: Why is this 4.5s longer than the previous 1.7s? What is the specific reason related to file size?

Thanks.

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-12T07:43:52.7333333+00:00
@连博10335043 Could you try with the Fast transcription API?

This is used to transcribe audio files with returning results synchronously and much faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency.

.

If you want to Try Out in AI Studio:

AI Studio -> AI Services -> Speech -> Fast Transcription

.

.

If you want to try it from curl command (REST API), follow this:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create

.

and

.

https://learn.microsoft.com/en-us/rest/api/speechtotext/transcriptions/transcribe?view=rest-speechtotext-2024-05-15-preview&tabs=HTTP

连博10335043 65

Hello,I'm sorry for replying so late. I want to implement the above function using Python code instead of curl. The code is as follows,：

import requests
import json
from requests_toolbelt.multipart.encoder import MultipartEncoder

 
url='https://eastus.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe'

data={'api-version': '2024-05-15-preview',
      
      } 
fields={
        'audio': open("./source/voice2024-07-08-16-25-57.wav" , 'rb'),
        'definition': "{\"locales\":[\"zh-CN\"]}"
        }  

headers={#'Content-Type': 'multipart/form-data',
          'Accept': 'application/json',
         'Ocp-Apim-Subscription-Key': '***********************',}  

r=requests.post(url=url,params=data, headers=headers,files=fields)
 
print(r.content)

but an error occurs:

'"Locales must be provided."'

Can you help me? thank you.

连博10335043 65 Reputation points

2024-08-16T08:39:35.26+00:00

Thank you very much.This is very helpful for me.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-16T09:28:49.7133333+00:00

@连博10335043 For the below answer, Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Accepted answer

0 additional answers

Your answer

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T04:47:06.39+00:00

@连博10335043 Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Firstly I see that you are checking the latency using the network traces. This is not the right way to check the latency.

Please rely on the latency metrics and also Apply splitting based on the API Name and Operation Name as shown below:

Suggestion:

If your concern is with respect to the latency, then you should leverage the fast transcription API.

.

Fast transcription: Fastest synchronous output for situations with predictable latency.

Fast transcription API is used to transcribe audio files with returning results synchronously and faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:

Quick audio or video transcription and subtitles: Quickly get a transcription of an entire video or audio file in one go.

Video translation: Immediately get new subtitles for a video if you have audio in different languages.

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
连博10335043 65 Reputation points

2024-08-01T07:25:56.8733333+00:00

Thank you.

I checked it. found that the latency is within 600ms,but I always receive results a long time later.

Can you tell me the meaning of the latency in the picture above
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T07:32:53.7966667+00:00

@连博10335043 Thanks for getting back. 600ms means that the response time is fast.

.

In Azure Speech Service, the latency metric for speech-to-text operations is primarily measured by the user-perceived latency (UPL). This is the time between when a word is spoken and when it appears in the recognition results.

Several factors can affect this latency, including:

Device specifications: CPU speed, architecture, and memory.

CPU load: Other applications running on the device can impact performance.

Memory load: The speech-to-text model consumes between 200-300 MB of memory at runtime.

Also ensure that your network speed is not slow.

More info here.
连博10335043 65 Reputation points

2024-08-01T08:01:44.5633333+00:00

Perhaps I didn't express myself clearly.

The above operation was called using the REST API format, and I uploaded a voice file. So there is no concept of time taken for words to be spoken and recognized.

I am more concerned about the latency between the final upload of the file and the return of the result.

Thank you.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T08:37:09.62+00:00

@连博10335043 Thanks for clarifying. Please test using Speech Studio and check if you encounter the same issue.

Also please let me know how are you calculating the latency? Like I mentioned about using Network monitor traces is not a valid test.
连博10335043 65 Reputation points

2024-08-01T08:58:25.7833333+00:00

......................
连博10335043 65 Reputation points

2024-08-01T09:02:17.3833333+00:00

In fact, the delay felt in Speech Studio is not very noticeable.

The basic idea for how I measure latency is to use the measurement in the following example code.https://github.com/Azure-Samples/Cognitive-Speech-TTS/blob/master/PronunciationAssessment/Python/sample.py

Calculate the time difference between uploading the last packet of data and returning the final result.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-01T09:04:45.6266667+00:00

@连博10335043 Please confirm if you are using the above python sample or postman ?

May I know which region is your Azure Speech Service deployed to ?

Please share the sample audio file which you are using ? Is that a .WAV file ? You may have to rename it to .txt and share it here.

Please share your resource URI of the speech service in below format:

/ subscriptions/XXXXXX/resourceGroups/XXXXX/providers/Microsoft.CognitiveServices/accounts/XXXXXX

Share the above details in private message.
连博10335043 65 Reputation points

2024-08-01T09:04:49.5366667+00:00

In fact, the latency felt in Speech Studio is not very noticeable.

The basic idea for how I measure latency is to use the measurement in the following example code.https://github.com/Azure-Samples/Cognitive-Speech-TTS/blob/master/PronunciationAssessment/Python/sample.py

Calculate the time difference between uploading the last packet of data and returning the final result.
连博10335043 65 Reputation points

2024-08-01T09:15:07.4133333+00:00

I am in China.The region of deployed is eastasia.

My test voice file is voice2024-07-30-09-30-30.txt

My resoure Url is

https://

The format is different from yours。When I make a POST request, I will add my subscriptionKey to the headers.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-05T02:24:03.9033333+00:00

@连博10335043 While performing the speech to text, Could you please share the transcription ID which is a GUID that looks like this fee14f44-1be7-4ad8-aef1-693225a3f3f1

Please share it with me. Using this, I can look at the backend logs easily.
连博10335043 65 Reputation points

2024-08-06T03:11:37.1033333+00:00

sorry.I only receive the email of the Private message. But can't find the Private message in this page.I also encountered this situation occasionally a few days ago.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-08T13:29:35.5866667+00:00

@连博10335043 Apologies for the late reply. I used the Speech to Text (for short audio) REST API sample code from this article.

I used the Chunked transfer which can help reduce recognition latency. It allows the Speech service to begin processing the audio file while it's transmitted.

.

.

I used the audio file which you had shared and I also calculated the time difference between uploading the last packet of data and returning the final result. The latency was 1.7 seconds only. See the below screenshot:

My suggestion is to use the chunked-transfer and test by using the same sample code in above article.

Hope this helps.
连博10335043 65 Reputation points

2024-08-09T01:11:32.27+00:00

Thank you.

Can you try again use this audio file voice2024-07-08-16-25-57.txt. Maybe the latency will be much longer 1.7 seconds. And please share the reason.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-12T07:43:52.7333333+00:00

@连博10335043 Could you try with the Fast transcription API?

This is used to transcribe audio files with returning results synchronously and much faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency.

.

If you want to Try Out in AI Studio:

AI Studio -> AI Services -> Speech -> Fast Transcription

.

.

If you want to try it from curl command (REST API), follow this:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create

.

and

.

https://learn.microsoft.com/en-us/rest/api/speechtotext/transcriptions/transcribe?view=rest-speechtotext-2024-05-15-preview&tabs=HTTP
连博10335043 65 Reputation points

2024-08-15T07:26:28.7366667+00:00

Hello,I'm sorry for replying so late. I want to implement the above function using Python code instead of curl. The code is as follows,：

import requests import json from requests_toolbelt.multipart.encoder import MultipartEncoder url='https://eastus.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe' data={'api-version': '2024-05-15-preview', } fields={ 'audio': open("./source/voice2024-07-08-16-25-57.wav" , 'rb'), 'definition': "{\"locales\":[\"zh-CN\"]}" } headers={#'Content-Type': 'multipart/form-data', 'Accept': 'application/json', 'Ocp-Apim-Subscription-Key': '***********************',} r=requests.post(url=url,params=data, headers=headers,files=fields) print(r.content)

but an error occurs:

'"Locales must be provided."'

Can you help me? thank you.
连博10335043 65 Reputation points

2024-08-16T08:39:35.26+00:00

Thank you very much.This is very helpful for me.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-16T09:28:49.7133333+00:00

@连博10335043 For the below answer, Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Answer 1

@连博10335043 Thanks for getting back. here is the updated Python code which used the FAST transcription API:

import requests
import time

url = "https://westeurope.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview"
headers = {
    "Accept": "application/json",
    "Ocp-Apim-Subscription-Key": "62cXXXXXXXXX65c6c"
}
files = {
    "audio": open("voiceBig.wav", "rb"),
    "definition": (
        None,
        '{"locales":["zh-CN"], "profanityFilterMode": "Masked", "channels": [0,1]}',
        "application/json"
    )
}

# Start the timer
start_time = time.time()
response = requests.post(url, headers=headers, files=files)

# Stop the timer
end_time = time.time()

# Calculate the time taken
time_taken = end_time - start_time


print(f"Response: {response.json()}")
print(f"Time taken: {time_taken} seconds")

Hope this helps.

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-08-16T09:28:27.78+00:00

@连博10335043 for the above answer, Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

Who can provide assistance?The time required for speech to text processing on the same file varies greatly, with a maximum of around 40%. Is Azure's performance like this?

0 additional answers

Your answer