Make a call and transcribe in real time

Rakesh Indla 5 Reputation points


My usecase is: Make a call from browser to phone number and once call is connected and start conversation, It should detect and transcribe dual conversation in real time. Is that possible? If yes, Please share code/reference.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,451 questions
Azure Communication Services
Azure Communication Services
An Azure communication platform for deploying applications across devices and platforms.
713 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,456 questions
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 46,996 Reputation points

    @Rakesh Indla

    Thanks for reaching out to us, from Azure Speech service point, you can do it by the Real-time diarization (Preview) feature.

    You can run an application for speech to text transcription with real-time diarization. Here, diarization is distinguishing between the different speakers participating in the conversation. The Speech service provides information about which speaker was speaking a particular part of transcribed speech.

    Please take a look at the document -

    To make a call from a web browser to a phone number and transcribe the conversation in real-time, you can use Azure Communication Services and Azure Cognitive Services. Here's a high-level outline of the steps and technologies involved:

    Azure Communication Services:

    Use Azure Communication Services to initiate the call from your web application to a phone number. Azure Communication Services provides capabilities for voice calling.

    Voice Calling from Browser:

    Implement the voice calling functionality in your web application using the Azure Communication Services Web SDK. You'll need to configure the SDK to make outbound calls.

    Azure Cognitive Services - Speech Service:

    Set up Azure Cognitive Services, specifically the Speech Service, for real-time transcription.

    Use the Speech SDK to transcribe the audio from the ongoing call. The Speech Service can convert spoken language into written text.

    Real-Time Transcription:

    As the call progresses, capture the audio and send it to the Speech Service for real-time transcription.

    You can use the WebSocket protocol to establish a connection with the Speech Service for real-time transcription.

    Display Transcription:

    Display the transcribed text in your web application in real-time so that users can see the conversation as it's transcribed.

    Monitoring and Error Handling:

    Implement monitoring and error handling to ensure that the transcription process is reliable and to address any issues that may arise during the call.

    Here's a simplified example of using the Azure Communication Services Web SDK and Azure Cognitive Services Speech SDK in JavaScript to initiate a call and perform real-time transcription:

    // Initialize the Azure Communication Services client
    const communication = new CommunicationUserClient("<your-communication-service-endpoint>");
    const userToken = await communication.createUserAndToken(["voip"]);
    // Make a call
    const callClient = new CallClient();
    const callAgent = await callClient.createCallAgent(userToken);
    const phoneNumber = "<phone-number-to-call>";
    const call = callAgent.startCall({ participants: [{ phoneNumber }] });
    // Initialize the Azure Cognitive Services Speech SDK for transcription
    const speechConfig = SpeechConfig.fromSubscription("<your-speech-service-subscription-key>", "<your-speech-service-region>");
    const audioConfig = AudioConfig.fromDefaultMicrophoneInput();
    const recognizer = new SpeechRecognizer(speechConfig, audioConfig);
    // Real-time transcription
    recognizer.recognizing = (s, e) => {
      console.log(`Transcribing: ${e.result.text}`);
    recognizer.recognized = (s, e) => {
      console.log(`Transcribed: ${e.result.text}`);
    // Start speech recognition
    // Handle call events and user interface in your application

    Please note that this is a simplified example, and you'll need to integrate it into your web application and add appropriate error handling, user interface, and call management features. Also, ensure that you have the necessary Azure subscriptions and configurations in place for both Azure Communication Services and Azure Cognitive Services.

    I hope this helps.



    -Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

    0 comments No comments