differentiation of speakers - speech to text

Question

Hi!
I want to know if it is possible to differentiate speakers in a conversion from an audio file to a text file. I don't want to define profiles or recognize who is speaking, I just want to know when a person is speaking and when another person starts speaking. If it is possible, please tell me how. I've been reading all those materials and taking the Azure courses, but none seems to offer what I need.
It would be something like:

person 1: ...
person 2: ...

Answer

@larissa kelmer Is your scenario similar to a call center conversation? There is an API called batch transcription which offers something similar but it is not available under the free tier of a speech resource. You would need to move to S0 tier and setup your audio recordings and call the API. There are some samples available to configure your speech resource and the storage locations which can simplify your setup to test the service and check if the required result is available.

If you plan to recognize your speakers in the conversation we would recommend registering the voice profile and use speaker recognition instead.

Answer

@romungi-MSFT Is there any update on this feature? We have a use case in which we need to differentiate the speaker from any babble in the background, so the background babble doesn't polute the speakers input. This would be a VERY useful feature preventing the speaker to correct its voice inputs.

The recognition result could include some indicator that shows whether the recognized text belongs to the speaker or not.

I think this is exactly what @larissa kelmer meant. Batch transcription can be used to transcribe lots of audio into text. But the key point here is to know whether or not the voice input comes from the speaker or belongs to background noise/babble. A human would do this by subconsciously comparing the volume and/or pitch only of what he hears. So the recognition result could include some probability value saying "this recognized text belongs to the speaker with a probability of X%".

Share via

differentiation of speakers - speech to text

2 answers

Your answer