differentiation of speakers - speech to text

larissa kelmer 1 Reputation point

I want to know if it is possible to differentiate speakers in a conversion from an audio file to a text file. I don't want to define profiles or recognize who is speaking, I just want to know when a person is speaking and when another person starts speaking. If it is possible, please tell me how. I've been reading all those materials and taking the Azure courses, but none seems to offer what I need.
It would be something like:

  • person 1: ...
  • person 2: ...
Azure Speech
Azure Speech
An Azure service that integrates speech processing into apps and services.
817 questions
Azure Cognitive Search
Azure Cognitive Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
211 questions
Azure Cognitive Services
Azure Cognitive Services
A group of Azure artificial intelligence services and cognitive APIs that help build intelligent apps.
997 questions
Cognitive Service for Language
Cognitive Service for Language
An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.
182 questions
No comments
{count} votes

2 answers

Sort by: Most helpful
  1. romungi-MSFT 29,711 Reputation points Microsoft Employee

    @larissa kelmer Is your scenario similar to a call center conversation? There is an API called batch transcription which offers something similar but it is not available under the free tier of a speech resource. You would need to move to S0 tier and setup your audio recordings and call the API. There are some samples available to configure your speech resource and the storage locations which can simplify your setup to test the service and check if the required result is available.

    If you plan to recognize your speakers in the conversation we would recommend registering the voice profile and use speaker recognition instead.

  2. Andreas Lange 41 Reputation points

    @romungi-MSFT Is there any update on this feature? We have a use case in which we need to differentiate the speaker from any babble in the background, so the background babble doesn't polute the speakers input. This would be a VERY useful feature preventing the speaker to correct its voice inputs.

    The recognition result could include some indicator that shows whether the recognized text belongs to the speaker or not.

    I think this is exactly what @larissa kelmer meant. Batch transcription can be used to transcribe lots of audio into text. But the key point here is to know whether or not the voice input comes from the speaker or belongs to background noise/babble. A human would do this by subconsciously comparing the volume and/or pitch only of what he hears. So the recognition result could include some probability value saying "this recognized text belongs to the speaker with a probability of X%".