Thanks for reaching out to us, from Azure Speech service point, you can do it by the Real-time diarization (Preview) feature.
You can run an application for speech to text transcription with real-time diarization. Here, diarization is distinguishing between the different speakers participating in the conversation. The Speech service provides information about which speaker was speaking a particular part of transcribed speech.
Please take a look at the document - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=linux&pivots=programming-language-python
To make a call from a web browser to a phone number and transcribe the conversation in real-time, you can use Azure Communication Services and Azure Cognitive Services. Here's a high-level outline of the steps and technologies involved:
Azure Communication Services:
Use Azure Communication Services to initiate the call from your web application to a phone number. Azure Communication Services provides capabilities for voice calling.
Voice Calling from Browser:
Implement the voice calling functionality in your web application using the Azure Communication Services Web SDK. You'll need to configure the SDK to make outbound calls.
Azure Cognitive Services - Speech Service:
Set up Azure Cognitive Services, specifically the Speech Service, for real-time transcription.
Use the Speech SDK to transcribe the audio from the ongoing call. The Speech Service can convert spoken language into written text.
Real-Time Transcription:
As the call progresses, capture the audio and send it to the Speech Service for real-time transcription.
You can use the WebSocket protocol to establish a connection with the Speech Service for real-time transcription.
Display Transcription:
Display the transcribed text in your web application in real-time so that users can see the conversation as it's transcribed.
Monitoring and Error Handling:
Implement monitoring and error handling to ensure that the transcription process is reliable and to address any issues that may arise during the call.
Here's a simplified example of using the Azure Communication Services Web SDK and Azure Cognitive Services Speech SDK in JavaScript to initiate a call and perform real-time transcription:
// Initialize the Azure Communication Services client
const communication = new CommunicationUserClient("<your-communication-service-endpoint>");
const userToken = await communication.createUserAndToken(["voip"]);
// Make a call
const callClient = new CallClient();
const callAgent = await callClient.createCallAgent(userToken);
const phoneNumber = "<phone-number-to-call>";
const call = callAgent.startCall({ participants: [{ phoneNumber }] });
// Initialize the Azure Cognitive Services Speech SDK for transcription
const speechConfig = SpeechConfig.fromSubscription("<your-speech-service-subscription-key>", "<your-speech-service-region>");
const audioConfig = AudioConfig.fromDefaultMicrophoneInput();
const recognizer = new SpeechRecognizer(speechConfig, audioConfig);
// Real-time transcription
recognizer.recognizing = (s, e) => {
console.log(`Transcribing: ${e.result.text}`);
};
recognizer.recognized = (s, e) => {
console.log(`Transcribed: ${e.result.text}`);
};
// Start speech recognition
recognizer.startContinuousRecognition();
// Handle call events and user interface in your application
Please note that this is a simplified example, and you'll need to integrate it into your web application and add appropriate error handling, user interface, and call management features. Also, ensure that you have the necessary Azure subscriptions and configurations in place for both Azure Communication Services and Azure Cognitive Services.
I hope this helps.
Regards,
Yutong
-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.