Direct Speech Line - Does the Azure Speech2Text service also bill the silences in the audio stream?

Question

Direct Speech Line - Does the Azure Speech2Text service also bill the silences in the audio stream?

Paolo Scomparin 51

Hi in Azure Bot with DirectSpeechLine configuration we want monitoring the time that user spent to speech for cost evaluation (by single user).

We found two method for collect the total seconds amount:

Using start and stop speech session events of ClientBotConnector with a stopwatch that measures the time
Using the property Duration on reconized event of ClientBotConnector

The first method, which is less accurate, collects more seconds than the second.

The second as described on the specifications not include trailing or leading silence.
(https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.speechrecognitionresult?view=azure-dotnet)

So... which of the two should we use to have a count of seconds that actually approaches the one used for billing the service?
Does the Azure Speech2Text service also bill the silences in the audio stream or in other words are all seconds of raw audio billed?

Best regards,

Paolo

Paolo Scomparin 51 Reputation points

2023-06-07T09:03:57.3766667+00:00

@romungi-MSFT thanks for your suggestions!
We try to isolate the results of two methods in a session test and then verify the results with the service metric in Azure console

1 answer

Your answer

Paolo Scomparin 51 Reputation points

2023-06-07T09:03:57.3766667+00:00

@romungi-MSFT thanks for your suggestions!
We try to isolate the results of two methods in a session test and then verify the results with the service metric in Azure console

Answer 1

@Paolo Scomparin The billing of Azure speech resource uses AudioSecondsTranscribed metric to bill speech to text audio that is passed to the service. If there is silence in audio, then it is counted in this metric.

User's image

The first method you mentioned will help to record the client-side duration of the audio or the duration for which the user uses the speaker button to record audio. I think this is ideal is you expect users to use a lot of initial and trailing silence.

The second method would be the most accurate in terms of actual audio as per the result, but you will miss on some seconds in the actual billing of your speech service.

Ideally, if you can configure a speech service dedicated to your client the above-mentioned metric will bill exactly as used. I hope this helps!!

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Direct Speech Line - Does the Azure Speech2Text service also bill the silences in the audio stream?

1 answer

Your answer