@Paolo Scomparin The billing of Azure speech resource uses AudioSecondsTranscribed metric to bill speech to text audio that is passed to the service. If there is silence in audio, then it is counted in this metric.
The first method you mentioned will help to record the client-side duration of the audio or the duration for which the user uses the speaker button to record audio. I think this is ideal is you expect users to use a lot of initial and trailing silence.
The second method would be the most accurate in terms of actual audio as per the result, but you will miss on some seconds in the actual billing of your speech service.
Ideally, if you can configure a speech service dedicated to your client the above-mentioned metric will bill exactly as used. I hope this helps!!
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.