differentiation of speakers - speech to text

larissa kelmer 1 Reputation point

I want to know if it is possible to differentiate speakers in a conversion from an audio file to a text file. I don't want to define profiles or recognize who is speaking, I just want to know when a person is speaking and when another person starts speaking. If it is possible, please tell me how. I've been reading all those materials and taking the Azure courses, but none seems to offer what I need.
It would be something like:

  • person 1: ...
  • person 2: ...
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,372 questions
Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
686 questions
Azure AI Language
Azure AI Language
An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.
348 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,338 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. romungi-MSFT 41,841 Reputation points Microsoft Employee

    @larissa kelmer Is your scenario similar to a call center conversation? There is an API called batch transcription which offers something similar but it is not available under the free tier of a speech resource. You would need to move to S0 tier and setup your audio recordings and call the API. There are some samples available to configure your speech resource and the storage locations which can simplify your setup to test the service and check if the required result is available.

    If you plan to recognize your speakers in the conversation we would recommend registering the voice profile and use speaker recognition instead.

  2. Andreas Lange 41 Reputation points

    @romungi-MSFT Is there any update on this feature? We have a use case in which we need to differentiate the speaker from any babble in the background, so the background babble doesn't polute the speakers input. This would be a VERY useful feature preventing the speaker to correct its voice inputs.

    The recognition result could include some indicator that shows whether the recognized text belongs to the speaker or not.

    I think this is exactly what @larissa kelmer meant. Batch transcription can be used to transcribe lots of audio into text. But the key point here is to know whether or not the voice input comes from the speaker or belongs to background noise/babble. A human would do this by subconsciously comparing the volume and/or pitch only of what he hears. So the recognition result could include some probability value saying "this recognized text belongs to the speaker with a probability of X%".

    0 comments No comments