Direct Speech Line - Does the Azure Speech2Text service also bill the silences in the audio stream?

Paolo Scomparin 51 Reputation points
2023-06-06T16:00:09.79+00:00

Hi in Azure Bot with DirectSpeechLine configuration we want monitoring the time that user spent to speech for cost evaluation (by single user).

We found two method for collect the total seconds amount:

  1. Using start and stop speech session events of ClientBotConnector with a stopwatch that measures the time
  2. Using the property Duration on reconized event of ClientBotConnector

The first method, which is less accurate, collects more seconds than the second.

The second as described on the specifications not include trailing or leading silence.
(https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.speechrecognitionresult?view=azure-dotnet)

So... which of the two should we use to have a count of seconds that actually approaches the one used for billing the service?
Does the Azure Speech2Text service also bill the silences in the audio stream or in other words are all seconds of raw audio billed?

Best regards,

Paolo

Azure AI Bot Service
Azure AI Bot Service
An Azure service that provides an integrated environment for bot development.
944 questions
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,070 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,632 questions
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator
    2023-06-07T07:59:56.4166667+00:00

    @Paolo Scomparin The billing of Azure speech resource uses AudioSecondsTranscribed metric to bill speech to text audio that is passed to the service. If there is silence in audio, then it is counted in this metric.

    User's image

    The first method you mentioned will help to record the client-side duration of the audio or the duration for which the user uses the speaker button to record audio. I think this is ideal is you expect users to use a lot of initial and trailing silence.

    The second method would be the most accurate in terms of actual audio as per the result, but you will miss on some seconds in the actual billing of your speech service.

    Ideally, if you can configure a speech service dedicated to your client the above-mentioned metric will bill exactly as used. I hope this helps!!

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.