Gather user input with Recognize action

Article
08/28/2023

This guide will help you get started with recognizing DTMF input provided by participants through Azure Communication Services Call Automation SDK.

Prerequisites

Azure account with an active subscription, for details see Create an account for free.
Azure Communication Services resource. See Create an Azure Communication Services resource. Note the connection string for this resource.
Create a new web service application using the Call Automation SDK.
The latest .NET library for your operating system.
Obtain the latest NuGet package.

For AI features

Create and connect Azure AI services to your Azure Communication Services resource.
Create a custom subdomain for your Azure AI services resource.

Technical specifications

The following parameters are available to customize the Recognize function:

Parameter	Type	Default (if not specified)	Description	Required or Optional
Prompt (for details on Play action, refer to this how-to guide)	FileSource, TextSource	Not set	This is the message you wish to play before recognizing input.	Optional
InterToneTimeout	TimeSpan	2 seconds Min: 1 second Max: 60 seconds	Limit in seconds that Azure Communication Services waits for the caller to press another digit (inter-digit timeout).	Optional
InitialSegmentationSilenceTimeoutInSeconds	Integer	0.5 second	How long recognize action waits for input before considering it a timeout. Read more here.	Optional
RecognizeInputsType	Enum	dtmf	Type of input that is recognized. Options are dtmf, choices, speech and speechordtmf.	Required
InitialSilenceTimeout	TimeSpan	5 seconds Min: 0 seconds Max: 300 seconds (DTMF) Max: 20 seconds (Choices) Max: 20 seconds (Speech)	Initial silence timeout adjusts how much nonspeech audio is allowed before a phrase before the recognition attempt ends in a "no match" result. Read more here.	Optional
MaxTonesToCollect	Integer	No default Min: 1	Number of digits a developer expects as input from the participant.	Required
StopTones	IEnumeration<DtmfTone>	Not set	The digit participants can press to escape out of a batch DTMF event.	Optional
InterruptPrompt	Bool	True	If the participant has the ability to interrupt the playMessage by pressing a digit.	Optional
InterruptCallMediaOperation	Bool	True	If this flag is set it interrupts the current call media operation. For example if any audio is being played it interrupts that operation and initiates recognize.	Optional
OperationContext	String	Not set	String that developers can pass mid action, useful for allowing developers to store context about the events they receive.	Optional
Phrases	String	Not set	List of phrases that associate to the label, if any of these are heard it is considered a successful recognition.	Required
Tone	String	Not set	The tone to recognize if user decides to press a number instead of using speech.	Optional
Label	String	Not set	The key value for recognition.	Required
Language	String	En-us	The language that is used for recognizing speech.	Optional
EndSilenceTimeout	TimeSpan	0.5 second	The final pause of the speaker used to detect the final result that gets generated as speech.	Optional

Note

In situations where both dtmf and speech are in the recognizeInputsType, the recognize action will act on the first input type received, i.e. if the user presses a keypad number first then the recognize action will consider it a dtmf event and continue listening for dtmf tones. If the user speaks first then the recognize action will consider it a speech recognition and listen for voice input.

Create a new C# application

In the console window of your operating system, use the dotnet command to create a new web application.

dotnet new web -n MyApplication

Install the NuGet package

The NuGet package can be obtained from here, if you haven't already done so.

Establish a call

By this point you should be familiar with starting calls, if you need to learn more about making a call, follow our quickstart. You can also use the code snippet provided here to understand how to answer a call.

var callAutomationClient = new CallAutomationClient("<Azure Communication Services connection string>");

var answerCallOptions = new AnswerCallOptions("<Incoming call context once call is connected>", new Uri("<https://sample-callback-uri>"))  
{  
    CallIntelligenceOptions = new CallIntelligenceOptions() { CognitiveServicesEndpoint = new Uri("<Azure Cognitive Services Endpoint>") } 
};  

var answerCallResult = await callAutomationClient.AnswerCallAsync(answerCallOptions);

Call the recognize action

When your application answers the call, you can provide information about recognizing participant input and playing a prompt.

DTMF

var maxTonesToCollect = 3;
String textToPlay = "Welcome to Contoso, please enter 3 DTMF.";
var playSource = new TextSource(textToPlay, "en-US-ElizabethNeural");
var recognizeOptions = new CallMediaRecognizeDtmfOptions(targetParticipant, maxTonesToCollect) {
  InitialSilenceTimeout = TimeSpan.FromSeconds(30),
    Prompt = playSource,
    InterToneTimeout = TimeSpan.FromSeconds(5),
    InterruptPrompt = true,
    StopTones = new DtmfTone[] {
      DtmfTone.Pound
    },
};
var recognizeResult = await callAutomationClient.GetCallConnection(callConnectionId)
  .GetCallMedia()
  .StartRecognizingAsync(recognizeOptions);

For speech-to-text flows, Call Automation recognize action also supports the use of custom speech models. Features like custom speech models can be useful when you're building an application that needs to listen for complex words which the default speech-to-text models may not be capable of understanding, a good example of this can be when you're building an application for the telemedical industry and your virtual agent needs to be able to recognize medical terms. You can learn more about creating and deploying custom speech models here.

Speech-to-Text Choices

var choices = new List < RecognitionChoice > {
  new RecognitionChoice("Confirm", new List < string > {
    "Confirm",
    "First",
    "One"
  }) {
    Tone = DtmfTone.One
  },
  new RecognitionChoice("Cancel", new List < string > {
    "Cancel",
    "Second",
    "Two"
  }) {
    Tone = DtmfTone.Two
  }
};
String textToPlay = "Hello, This is a reminder for your appointment at 2 PM, Say Confirm to confirm your appointment or Cancel to cancel the appointment. Thank you!";

var playSource = new TextSource(textToPlay, "en-US-ElizabethNeural");
var recognizeOptions = new CallMediaRecognizeChoiceOptions(targetParticipant, choices) {
  InterruptPrompt = true,
    InitialSilenceTimeout = TimeSpan.FromSeconds(30),
    Prompt = playSource,
    OperationContext = "AppointmentReminderMenu",
    //Only add the SpeechModelEndpointId if you have a custom speech model you would like to use
    SpeechModelEndpointId = "YourCustomSpeechModelEndpointId"
};
var recognizeResult = await callAutomationClient.GetCallConnection(callConnectionId)
  .GetCallMedia()
  .StartRecognizingAsync(recognizeOptions);

Speech-to-Text

String textToPlay = "Hi, how can I help you today?";
var playSource = new TextSource(textToPlay, "en-US-ElizabethNeural");
var recognizeOptions = new CallMediaRecognizeSpeechOptions(targetParticipant) {
  Prompt = playSource,
    EndSilenceTimeout = TimeSpan.FromMilliseconds(1000),
    OperationContext = "OpenQuestionSpeech",
    //Only add the SpeechModelEndpointId if you have a custom speech model you would like to use
    SpeechModelEndpointId = "YourCustomSpeechModelEndpointId"
};
var recognizeResult = await callAutomationClient.GetCallConnection(callConnectionId)
  .GetCallMedia()
  .StartRecognizingAsync(recognizeOptions);

Speech-to-Text or DTMF

var maxTonesToCollect = 1; 
String textToPlay = "Hi, how can I help you today, you can press 0 to speak to an agent?"; 
var playSource = new TextSource(textToPlay, "en-US-ElizabethNeural"); 
var recognizeOptions = new CallMediaRecognizeSpeechOrDtmfOptions(targetParticipant, maxTonesToCollect) 
{ 
    Prompt = playSource, 
    EndSilenceTimeout = TimeSpan.FromMilliseconds(1000), 
    InitialSilenceTimeout = TimeSpan.FromSeconds(30), 
    InterruptPrompt = true, 
    OperationContext = "OpenQuestionSpeechOrDtmf",
    //Only add the SpeechModelEndpointId if you have a custom speech model you would like to use
    SpeechModelEndpointId = "YourCustomSpeechModelEndpointId" 
}; 
var recognizeResult = await callAutomationClient.GetCallConnection(callConnectionId) 
    .GetCallMedia() 
    .StartRecognizingAsync(recognizeOptions);

Note

If parameters aren't set, the defaults are applied where possible.

Receiving recognize event updates

Developers can subscribe to the RecognizeCompleted and RecognizeFailed events on the webhook callback they registered for the call to create business logic in their application for determining next steps when one of the previously mentioned events occurs.

Example of how you can deserialize the RecognizeCompleted event:

if (acsEvent is RecognizeCompleted recognizeCompleted) 
{ 
    switch (recognizeCompleted.RecognizeResult) 
    { 
        case DtmfResult dtmfResult: 
            //Take action for Recognition through DTMF 
            var tones = dtmfResult.Tones; 
            logger.LogInformation("Recognize completed succesfully, tones={tones}", tones); 
            break; 
        case ChoiceResult choiceResult: 
            // Take action for Recognition through Choices 
            var labelDetected = choiceResult.Label; 
            var phraseDetected = choiceResult.RecognizedPhrase; 
            // If choice is detected by phrase, choiceResult.RecognizedPhrase will have the phrase detected, 
            // If choice is detected using dtmf tone, phrase will be null 
            logger.LogInformation("Recognize completed succesfully, labelDetected={labelDetected}, phraseDetected={phraseDetected}", labelDetected, phraseDetected);
            break; 
        case SpeechResult speechResult: 
            // Take action for Recognition through Choices 
            var text = speechResult.Speech; 
            logger.LogInformation("Recognize completed succesfully, text={text}", text); 
            break; 
        default: 
            logger.LogInformation("Recognize completed succesfully, recognizeResult={recognizeResult}", recognizeCompleted.RecognizeResult); 
            break; 
    } 
}

Example of how you can deserialize the RecognizeFailed event:

if (acsEvent is RecognizeFailed recognizeFailed) 
{ 
    if (MediaEventReasonCode.RecognizeInitialSilenceTimedOut.Equals(recognizeFailed.ReasonCode)) 
    { 
        // Take action for time out 
        logger.LogInformation("Recognition failed: initial silencev time out"); 
    } 
    else if (MediaEventReasonCode.RecognizeSpeechOptionNotMatched.Equals(recognizeFailed.ReasonCode)) 
    { 
        // Take action for option not matched 
        logger.LogInformation("Recognition failed: speech option not matched"); 
    } 
    else if (MediaEventReasonCode.RecognizeIncorrectToneDetected.Equals(recognizeFailed.ReasonCode)) 
    { 
        // Take action for incorrect tone 
        logger.LogInformation("Recognition failed: incorrect tone detected"); 
    } 
    else 
    { 
        logger.LogInformation("Recognition failed, result={result}, context={context}", recognizeFailed.ResultInformation?.Message, recognizeFailed.OperationContext); 
    } 
}

Example of how you can deserialize the RecognizeCanceled event:

if (acsEvent is RecognizeCanceled { OperationContext: "AppointmentReminderMenu" })
        {
            logger.LogInformation($"RecognizeCanceled event received for call connection id: {@event.CallConnectionId}");
            //Take action on recognize canceled operation
           await callConnection.HangUpAsync(forEveryone: true);
        }

Prerequisites

Azure account with an active subscription, for details see Create an account for free.
Azure Communication Services resource. See Create an Azure Communication Services resource
Create a new web service application using the Call Automation SDK.
Java Development Kit version 8 or above.
Apache Maven.

For AI features

Create and connect Azure AI services to your Azure Communication Services resource.
Create a custom subdomain for your Azure AI services resource.

Technical specifications

The following parameters are available to customize the Recognize function:

Parameter	Type	Default (if not specified)	Description	Required or Optional
Prompt (for details on Play action, refer to this how-to guide)	FileSource, TextSource	Not set	This is the message you wish to play before recognizing input.	Optional
InterToneTimeout	TimeSpan	2 seconds Min: 1 second Max: 60 seconds	Limit in seconds that Azure Communication Services waits for the caller to press another digit (inter-digit timeout).	Optional
InitialSegmentationSilenceTimeoutInSeconds	Integer	0.5 second	How long recognize action waits for input before considering it a timeout. Read more here.	Optional
RecognizeInputsType	Enum	dtmf	Type of input that is recognized. Options are dtmf, choices, speech and speechordtmf.	Required
InitialSilenceTimeout	TimeSpan	5 seconds Min: 0 seconds Max: 300 seconds (DTMF) Max: 20 seconds (Choices) Max: 20 seconds (Speech)	Initial silence timeout adjusts how much nonspeech audio is allowed before a phrase before the recognition attempt ends in a "no match" result. Read more here.	Optional
MaxTonesToCollect	Integer	No default Min: 1	Number of digits a developer expects as input from the participant.	Required
StopTones	IEnumeration<DtmfTone>	Not set	The digit participants can press to escape out of a batch DTMF event.	Optional
InterruptPrompt	Bool	True	If the participant has the ability to interrupt the playMessage by pressing a digit.	Optional
InterruptCallMediaOperation	Bool	True	If this flag is set it interrupts the current call media operation. For example if any audio is being played it interrupts that operation and initiates recognize.	Optional
OperationContext	String	Not set	String that developers can pass mid action, useful for allowing developers to store context about the events they receive.	Optional
Phrases	String	Not set	List of phrases that associate to the label, if any of these are heard it is considered a successful recognition.	Required
Tone	String	Not set	The tone to recognize if user decides to press a number instead of using speech.	Optional
Label	String	Not set	The key value for recognition.	Required
Language	String	En-us	The language that is used for recognizing speech.	Optional
EndSilenceTimeout	TimeSpan	0.5 second	The final pause of the speaker used to detect the final result that gets generated as speech.	Optional

Note

Create a new Java application

In your terminal or command window, navigate to the directory where you would like to create your Java application. Run the mvn command to generate the Java project from the maven-archetype-quickstart template.

mvn archetype:generate -DgroupId=com.communication.quickstart -DartifactId=communication-quickstart -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false

The mvn command creates a directory with the same name as artifactId argument. Under this directory, src/main/java directory contains the project source code, src/test/java directory contains the test source.

You notice that the 'generate' step created a directory with the same name as the artifactId. Under this directory, src/main/java directory contains source code, src/test/java directory contains tests, and pom.xml file is the project's Project Object Model, or POM.

Update your applications POM file to use Java 8 or higher.

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
</properties>

Add package references

In your POM file, add the following reference for the project

azure-communication-callautomation

<dependency>
  <groupId>com.azure</groupId>
  <artifactId>azure-communication-callautomation</artifactId>
  <version>1.0.0</version>
</dependency>

Establish a call

CallIntelligenceOptions callIntelligenceOptions = new CallIntelligenceOptions().setCognitiveServicesEndpoint("https://sample-cognitive-service-resource.cognitiveservices.azure.com/"); 
answerCallOptions = new AnswerCallOptions("<Incoming call context>", "<https://sample-callback-uri>").setCallIntelligenceOptions(callIntelligenceOptions); 
Response < AnswerCallResult > answerCallResult = callAutomationClient
  .answerCallWithResponse(answerCallOptions)
  .block();

Call the recognize action

When your application answers the call, you can provide information about recognizing participant input and playing a prompt.

DTMF

var maxTonesToCollect = 3;
String textToPlay = "Welcome to Contoso, please enter 3 DTMF.";
var playSource = new TextSource() 
    .setText(textToPlay) 
    .setVoiceName("en-US-ElizabethNeural");

var recognizeOptions = new CallMediaRecognizeDtmfOptions(targetParticipant, maxTonesToCollect) 
    .setInitialSilenceTimeout(Duration.ofSeconds(30)) 
    .setPlayPrompt(playSource) 
    .setInterToneTimeout(Duration.ofSeconds(5)) 
    .setInterruptPrompt(true) 
    .setStopTones(Arrays.asList(DtmfTone.POUND));

var recognizeResponse = callAutomationClient.getCallConnectionAsync(callConnectionId) 
    .getCallMediaAsync() 
    .startRecognizingWithResponse(recognizeOptions) 
    .block(); 

log.info("Start recognizing result: " + recognizeResponse.getStatusCode());

Speech-to-Text Choices

var choices = Arrays.asList(
  new RecognitionChoice()
  .setLabel("Confirm")
  .setPhrases(Arrays.asList("Confirm", "First", "One"))
  .setTone(DtmfTone.ONE),
  new RecognitionChoice()
  .setLabel("Cancel")
  .setPhrases(Arrays.asList("Cancel", "Second", "Two"))
  .setTone(DtmfTone.TWO)
);

String textToPlay = "Hello, This is a reminder for your appointment at 2 PM, Say Confirm to confirm your appointment or Cancel to cancel the appointment. Thank you!";
var playSource = new TextSource()
  .setText(textToPlay)
  .setVoiceName("en-US-ElizabethNeural");
var recognizeOptions = new CallMediaRecognizeChoiceOptions(targetParticipant, choices)
  .setInterruptPrompt(true)
  .setInitialSilenceTimeout(Duration.ofSeconds(30))
  .setPlayPrompt(playSource)
  .setOperationContext("AppointmentReminderMenu")
  //Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
  .setSpeechRecognitionModelEndpointId("YourCustomSpeechModelEndpointID"); 
var recognizeResponse = callAutomationClient.getCallConnectionAsync(callConnectionId)
  .getCallMediaAsync()
  .startRecognizingWithResponse(recognizeOptions)
  .block();

Speech-to-Text

String textToPlay = "Hi, how can I help you today?"; 
var playSource = new TextSource() 
    .setText(textToPlay) 
    .setVoiceName("en-US-ElizabethNeural"); 
var recognizeOptions = new CallMediaRecognizeSpeechOptions(targetParticipant, Duration.ofMillis(1000)) 
    .setPlayPrompt(playSource) 
    .setOperationContext("OpenQuestionSpeech")
    //Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    .setSpeechRecognitionModelEndpointId("YourCustomSpeechModelEndpointID");  
var recognizeResponse = callAutomationClient.getCallConnectionAsync(callConnectionId) 
    .getCallMediaAsync() 
    .startRecognizingWithResponse(recognizeOptions) 
    .block();

Speech-to-Text or DTMF

var maxTonesToCollect = 1; 
String textToPlay = "Hi, how can I help you today, you can press 0 to speak to an agent?"; 
var playSource = new TextSource() 
    .setText(textToPlay) 
    .setVoiceName("en-US-ElizabethNeural"); 
var recognizeOptions = new CallMediaRecognizeSpeechOrDtmfOptions(targetParticipant, maxTonesToCollect, Duration.ofMillis(1000)) 
    .setPlayPrompt(playSource) 
    .setInitialSilenceTimeout(Duration.ofSeconds(30)) 
    .setInterruptPrompt(true) 
    .setOperationContext("OpenQuestionSpeechOrDtmf")
    //Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    .setSpeechRecognitionModelEndpointId("YourCustomSpeechModelEndpointID");  
var recognizeResponse = callAutomationClient.getCallConnectionAsync(callConnectionId) 
    .getCallMediaAsync() 
    .startRecognizingWithResponse(recognizeOptions) 
    .block();

Note

If parameters aren't set, the defaults are applied where possible.

Receiving recognize event updates

Developers can subscribe to RecognizeCompleted and RecognizeFailed events on the registered webhook callback. This callback can be used with business logic in your application for determining next steps when one of the events occurs.

Example of how you can deserialize the RecognizeCompleted event:

if (acsEvent instanceof RecognizeCompleted) { 
    RecognizeCompleted event = (RecognizeCompleted) acsEvent; 
    RecognizeResult recognizeResult = event.getRecognizeResult().get(); 
    if (recognizeResult instanceof DtmfResult) { 
        // Take action on collect tones 
        DtmfResult dtmfResult = (DtmfResult) recognizeResult; 
        List<DtmfTone> tones = dtmfResult.getTones(); 
        log.info("Recognition completed, tones=" + tones + ", context=" + event.getOperationContext()); 
    } else if (recognizeResult instanceof ChoiceResult) { 
        ChoiceResult collectChoiceResult = (ChoiceResult) recognizeResult; 
        String labelDetected = collectChoiceResult.getLabel(); 
        String phraseDetected = collectChoiceResult.getRecognizedPhrase(); 
        log.info("Recognition completed, labelDetected=" + labelDetected + ", phraseDetected=" + phraseDetected + ", context=" + event.getOperationContext()); 
    } else if (recognizeResult instanceof SpeechResult) { 
        SpeechResult speechResult = (SpeechResult) recognizeResult; 
        String text = speechResult.getSpeech(); 
        log.info("Recognition completed, text=" + text + ", context=" + event.getOperationContext()); 
    } else { 
        log.info("Recognition completed, result=" + recognizeResult + ", context=" + event.getOperationContext()); 
    } 
}

Example of how you can deserialize the RecognizeFailed event:

if (acsEvent instanceof RecognizeFailed) { 
    RecognizeFailed event = (RecognizeFailed) acsEvent; 
    if (ReasonCode.Recognize.INITIAL_SILENCE_TIMEOUT.equals(event.getReasonCode())) { 
        // Take action for time out 
        log.info("Recognition failed: initial silence time out"); 
    } else if (ReasonCode.Recognize.SPEECH_OPTION_NOT_MATCHED.equals(event.getReasonCode())) { 
        // Take action for option not matched 
        log.info("Recognition failed: speech option not matched"); 
    } else if (ReasonCode.Recognize.DMTF_OPTION_MATCHED.equals(event.getReasonCode())) { 
        // Take action for incorrect tone 
        log.info("Recognition failed: incorrect tone detected"); 
    } else { 
        log.info("Recognition failed, result=" + event.getResultInformation().getMessage() + ", context=" + event.getOperationContext()); 
    } 
}

Example of how you can deserialize the RecognizeCanceled event:

if (acsEvent instanceof RecognizeCanceled) { 
    RecognizeCanceled event = (RecognizeCanceled) acsEvent; 
    log.info("Recognition canceled, context=" + event.getOperationContext()); 
}

Prerequisites

Azure account with an active subscription, for details see Create an account for free.
Azure Communication Services resource. See Create an Azure Communication Services resource. Note the connection string for this resource.
Create a new web service application using the Call Automation SDK.
Have Node.js installed, you can install it from their official website.

For AI features

Create and connect Azure AI services to your Azure Communication Services resource.
Create a custom subdomain for your Azure AI services resource.

Technical specifications

The following parameters are available to customize the Recognize function:

Parameter	Type	Default (if not specified)	Description	Required or Optional
Prompt (for details on Play action, refer to this how-to guide)	FileSource, TextSource	Not set	This is the message you wish to play before recognizing input.	Optional
InterToneTimeout	TimeSpan	2 seconds Min: 1 second Max: 60 seconds	Limit in seconds that Azure Communication Services waits for the caller to press another digit (inter-digit timeout).	Optional
InitialSegmentationSilenceTimeoutInSeconds	Integer	0.5 second	How long recognize action waits for input before considering it a timeout. Read more here.	Optional
RecognizeInputsType	Enum	dtmf	Type of input that is recognized. Options are dtmf, choices, speech and speechordtmf.	Required
InitialSilenceTimeout	TimeSpan	5 seconds Min: 0 seconds Max: 300 seconds (DTMF) Max: 20 seconds (Choices) Max: 20 seconds (Speech)	Initial silence timeout adjusts how much nonspeech audio is allowed before a phrase before the recognition attempt ends in a "no match" result. Read more here.	Optional
MaxTonesToCollect	Integer	No default Min: 1	Number of digits a developer expects as input from the participant.	Required
StopTones	IEnumeration<DtmfTone>	Not set	The digit participants can press to escape out of a batch DTMF event.	Optional
InterruptPrompt	Bool	True	If the participant has the ability to interrupt the playMessage by pressing a digit.	Optional
InterruptCallMediaOperation	Bool	True	If this flag is set it interrupts the current call media operation. For example if any audio is being played it interrupts that operation and initiates recognize.	Optional
OperationContext	String	Not set	String that developers can pass mid action, useful for allowing developers to store context about the events they receive.	Optional
Phrases	String	Not set	List of phrases that associate to the label, if any of these are heard it is considered a successful recognition.	Required
Tone	String	Not set	The tone to recognize if user decides to press a number instead of using speech.	Optional
Label	String	Not set	The key value for recognition.	Required
Language	String	En-us	The language that is used for recognizing speech.	Optional
EndSilenceTimeout	TimeSpan	0.5 second	The final pause of the speaker used to detect the final result that gets generated as speech.	Optional

Note

Create a new JavaScript application

Create a new JavaScript application in your project directory. Initialize a new Node.js project with the following command. This creates a package.json file for your project, which is used to manage your project's dependencies.

npm init -y

Install the Azure Communication Services Call Automation package

npm install @azure/communication-call-automation

Create a new JavaScript file in your project directory, for example, name it app.js. You write your JavaScript code in this file. Run your application using Node.js with the following command. This executes the JavaScript code you have written.

node app.js

Establish a call

By this point you should be familiar with starting calls, if you need to learn more about making a call, follow our quickstart. In this quickstart, we create an outbound call.

Call the recognize action

When your application answers the call, you can provide information about recognizing participant input and playing a prompt.

DTMF

const maxTonesToCollect = 3; 
const textToPlay = "Welcome to Contoso, please enter 3 DTMF."; 
const playSource: TextSource = { text: textToPlay, voiceName: "en-US-ElizabethNeural", kind: "textSource" }; 
const recognizeOptions: CallMediaRecognizeDtmfOptions = { 
    maxTonesToCollect: maxTonesToCollect, 
    initialSilenceTimeoutInSeconds: 30, 
    playPrompt: playSource, 
    interToneTimeoutInSeconds: 5, 
    interruptPrompt: true, 
    stopDtmfTones: [ DtmfTone.Pound ], 
    kind: "callMediaRecognizeDtmfOptions" 
}; 

await callAutomationClient.getCallConnection(callConnectionId) 
    .getCallMedia() 
    .startRecognizing(targetParticipant, recognizeOptions);

Speech-to-Text Choices

const choices = [ 
    {  
        label: "Confirm", 
        phrases: [ "Confirm", "First", "One" ], 
        tone: DtmfTone.One 
    }, 
    { 
        label: "Cancel", 
        phrases: [ "Cancel", "Second", "Two" ], 
        tone: DtmfTone.Two 
    } 
]; 

const textToPlay = "Hello, This is a reminder for your appointment at 2 PM, Say Confirm to confirm your appointment or Cancel to cancel the appointment. Thank you!"; 
const playSource: TextSource = { text: textToPlay, voiceName: "en-US-ElizabethNeural", kind: "textSource" }; 
const recognizeOptions: CallMediaRecognizeChoiceOptions = { 
    choices: choices, 
    interruptPrompt: true, 
    initialSilenceTimeoutInSeconds: 30, 
    playPrompt: playSource, 
    operationContext: "AppointmentReminderMenu", 
    kind: "callMediaRecognizeChoiceOptions",
    //Only add the speechRecognitionModelEndpointId if you have a custom speech model you would like to use
    speechRecognitionModelEndpointId: "YourCustomSpeechEndpointId"
}; 

await callAutomationClient.getCallConnection(callConnectionId) 
    .getCallMedia() 
    .startRecognizing(targetParticipant, recognizeOptions);

Speech-to-Text

const textToPlay = "Hi, how can I help you today?"; 
const playSource: TextSource = { text: textToPlay, voiceName: "en-US-ElizabethNeural", kind: "textSource" }; 
const recognizeOptions: CallMediaRecognizeSpeechOptions = { 
    endSilenceTimeoutInSeconds: 1, 
    playPrompt: playSource, 
    operationContext: "OpenQuestionSpeech", 
    kind: "callMediaRecognizeSpeechOptions",
    //Only add the speechRecognitionModelEndpointId if you have a custom speech model you would like to use
    speechRecognitionModelEndpointId: "YourCustomSpeechEndpointId"
}; 

await callAutomationClient.getCallConnection(callConnectionId) 
    .getCallMedia() 
    .startRecognizing(targetParticipant, recognizeOptions);

Speech-to-Text or DTMF

const maxTonesToCollect = 1; 
const textToPlay = "Hi, how can I help you today, you can press 0 to speak to an agent?"; 
const playSource: TextSource = { text: textToPlay, voiceName: "en-US-ElizabethNeural", kind: "textSource" }; 
const recognizeOptions: CallMediaRecognizeSpeechOrDtmfOptions = { 
    maxTonesToCollect: maxTonesToCollect, 
    endSilenceTimeoutInSeconds: 1, 
    playPrompt: playSource, 
    initialSilenceTimeoutInSeconds: 30, 
    interruptPrompt: true, 
    operationContext: "OpenQuestionSpeechOrDtmf", 
    kind: "callMediaRecognizeSpeechOrDtmfOptions",
    //Only add the speechRecognitionModelEndpointId if you have a custom speech model you would like to use
    speechRecognitionModelEndpointId: "YourCustomSpeechEndpointId"
}; 

await callAutomationClient.getCallConnection(callConnectionId) 
    .getCallMedia() 
    .startRecognizing(targetParticipant, recognizeOptions);

Note

If parameters aren't set, the defaults are applied where possible.

Receiving recognize event updates

Example of how you can deserialize the RecognizeCompleted event:

if (event.type === "Microsoft.Communication.RecognizeCompleted") { 
    if (eventData.recognitionType === "dtmf") { 
        const tones = eventData.dtmfResult.tones; 
        console.log("Recognition completed, tones=%s, context=%s", tones, eventData.operationContext); 
    } else if (eventData.recognitionType === "choices") { 
        const labelDetected = eventData.choiceResult.label; 
        const phraseDetected = eventData.choiceResult.recognizedPhrase; 
        console.log("Recognition completed, labelDetected=%s, phraseDetected=%s, context=%s", labelDetected, phraseDetected, eventData.operationContext); 
    } else if (eventData.recognitionType === "speech") { 
        const text = eventData.speechResult.speech; 
        console.log("Recognition completed, text=%s, context=%s", text, eventData.operationContext); 
    } else { 
        console.log("Recognition completed: data=%s", JSON.stringify(eventData, null, 2)); 
    } 
}

Example of how you can deserialize the RecognizeFailed event:

if (event.type === "Microsoft.Communication.RecognizeFailed") {
    console.log("Recognize failed: data=%s", JSON.stringify(eventData, null, 2));
}

Example of how you can deserialize the RecognizeCanceled event:

if (event.type === "Microsoft.Communication.RecognizeCanceled") {
    console.log("Recognize canceled, context=%s", eventData.operationContext);
}

Prerequisites

Azure account with an active subscription, for details see Create an account for free.
Azure Communication Services resource. See Create an Azure Communication Services resource. Note the connection string for this resource.
Create a new web service application using the Call Automation SDK.
Have Python installed, you can install from the official site.

For AI features

Create and connect Azure AI services to your Azure Communication Services resource.
Create a custom subdomain for your Azure AI services resource.

Technical specifications

The following parameters are available to customize the Recognize function:

Parameter	Type	Default (if not specified)	Description	Required or Optional
Prompt (for details on Play action, refer to this how-to guide)	FileSource, TextSource	Not set	This is the message you wish to play before recognizing input.	Optional
InterToneTimeout	TimeSpan	2 seconds Min: 1 second Max: 60 seconds	Limit in seconds that Azure Communication Services waits for the caller to press another digit (inter-digit timeout).	Optional
InitialSegmentationSilenceTimeoutInSeconds	Integer	0.5 second	How long recognize action waits for input before considering it a timeout. Read more here.	Optional
RecognizeInputsType	Enum	dtmf	Type of input that is recognized. Options are dtmf, choices, speech and speechordtmf.	Required
InitialSilenceTimeout	TimeSpan	5 seconds Min: 0 seconds Max: 300 seconds (DTMF) Max: 20 seconds (Choices) Max: 20 seconds (Speech)	Initial silence timeout adjusts how much nonspeech audio is allowed before a phrase before the recognition attempt ends in a "no match" result. Read more here.	Optional
MaxTonesToCollect	Integer	No default Min: 1	Number of digits a developer expects as input from the participant.	Required
StopTones	IEnumeration<DtmfTone>	Not set	The digit participants can press to escape out of a batch DTMF event.	Optional
InterruptPrompt	Bool	True	If the participant has the ability to interrupt the playMessage by pressing a digit.	Optional
InterruptCallMediaOperation	Bool	True	If this flag is set it interrupts the current call media operation. For example if any audio is being played it interrupts that operation and initiates recognize.	Optional
OperationContext	String	Not set	String that developers can pass mid action, useful for allowing developers to store context about the events they receive.	Optional
Phrases	String	Not set	List of phrases that associate to the label, if any of these are heard it is considered a successful recognition.	Required
Tone	String	Not set	The tone to recognize if user decides to press a number instead of using speech.	Optional
Label	String	Not set	The key value for recognition.	Required
Language	String	En-us	The language that is used for recognizing speech.	Optional
EndSilenceTimeout	TimeSpan	0.5 second	The final pause of the speaker used to detect the final result that gets generated as speech.	Optional

Note

Create a new Python application

Set up a Python virtual environment for your project

python -m venv play-audio-app

Activate your virtual environment

On windows, use the following command:

.\ play-audio-quickstart \Scripts\activate

On Unix, use the following command:

source play-audio-quickstart /bin/activate

Install the Azure Communication Services Call Automation package

pip install azure-communication-callautomation

Create your application file in your project directory, for example, name it app.py. You write your Python code in this file.

Run your application using Python with the following command. This executes the Python code you have written.

python app.py

Establish a call

By this point you should be familiar with starting calls, if you need to learn more about making a call, follow our quickstart. In this quickstart, we create an outbound call.

Call the recognize action

When your application answers the call, you can provide information about recognizing participant input and playing a prompt.

DTMF

max_tones_to_collect = 3 
text_to_play = "Welcome to Contoso, please enter 3 DTMF." 
play_source = TextSource(text=text_to_play, voice_name="en-US-ElizabethNeural") 
call_automation_client.get_call_connection(call_connection_id).start_recognizing_media( 
    dtmf_max_tones_to_collect=max_tones_to_collect, 
    input_type=RecognizeInputType.DTMF, 
    target_participant=target_participant, 
    initial_silence_timeout=30, 
    play_prompt=play_source, 
    dtmf_inter_tone_timeout=5, 
    interrupt_prompt=True, 
    dtmf_stop_tones=[ DtmfTone.Pound ])

Speech-to-Text Choices

choices = [ 
    RecognitionChoice( 
        label="Confirm", 
        phrases=[ "Confirm", "First", "One" ], 
        tone=DtmfTone.ONE 
    ), 
    RecognitionChoice( 
        label="Cancel", 
        phrases=[ "Cancel", "Second", "Two" ], 
        tone=DtmfTone.TWO 
    ) 
] 
text_to_play = "Hello, This is a reminder for your appointment at 2 PM, Say Confirm to confirm your appointment or Cancel to cancel the appointment. Thank you!" 
play_source = TextSource(text=text_to_play, voice_name="en-US-ElizabethNeural") 
call_automation_client.get_call_connection(call_connection_id).start_recognizing_media( 
    input_type=RecognizeInputType.CHOICES, 
    target_participant=target_participant, 
    choices=choices, 
    interrupt_prompt=True, 
    initial_silence_timeout=30, 
    play_prompt=play_source, 
    operation_context="AppointmentReminderMenu",
    # Only add the speech_recognition_model_endpoint_id if you have a custom speech model you would like to use
    speech_recognition_model_endpoint_id="YourCustomSpeechModelEndpointId")

Speech-to-Text

text_to_play = "Hi, how can I help you today?" 
play_source = TextSource(text=text_to_play, voice_name="en-US-ElizabethNeural") 
call_automation_client.get_call_connection(call_connection_id).start_recognizing_media( 
    input_type=RecognizeInputType.SPEECH, 
    target_participant=target_participant, 
    end_silence_timeout=1, 
    play_prompt=play_source, 
    operation_context="OpenQuestionSpeech",
    # Only add the speech_recognition_model_endpoint_id if you have a custom speech model you would like to use
    speech_recognition_model_endpoint_id="YourCustomSpeechModelEndpointId")

Speech-to-Text or DTMF

max_tones_to_collect = 1 
text_to_play = "Hi, how can I help you today, you can also press 0 to speak to an agent." 
play_source = TextSource(text=text_to_play, voice_name="en-US-ElizabethNeural") 
call_automation_client.get_call_connection(call_connection_id).start_recognizing_media( 
    dtmf_max_tones_to_collect=max_tones_to_collect, 
    input_type=RecognizeInputType.SPEECH_OR_DTMF, 
    target_participant=target_participant, 
    end_silence_timeout=1, 
    play_prompt=play_source, 
    initial_silence_timeout=30, 
    interrupt_prompt=True, 
    operation_context="OpenQuestionSpeechOrDtmf",
    # Only add the speech_recognition_model_endpoint_id if you have a custom speech model you would like to use
    speech_recognition_model_endpoint_id="YourCustomSpeechModelEndpointId")  
app.logger.info("Start recognizing")

Note

If parameters aren't set, the defaults are applied where possible.

Receiving recognize event updates

Example of how you can deserialize the RecognizeCompleted event:

if event.type == "Microsoft.Communication.RecognizeCompleted": 
    app.logger.info("Recognize completed: data=%s", event.data) 
    if event.data['recognitionType'] == "dtmf": 
        tones = event.data['dtmfResult']['tones'] 
        app.logger.info("Recognition completed, tones=%s, context=%s", tones, event.data.get('operationContext')) 
    elif event.data['recognitionType'] == "choices": 
        labelDetected = event.data['choiceResult']['label']; 
        phraseDetected = event.data['choiceResult']['recognizedPhrase']; 
        app.logger.info("Recognition completed, labelDetected=%s, phraseDetected=%s, context=%s", labelDetected, phraseDetected, event.data.get('operationContext')); 
    elif event.data['recognitionType'] == "speech": 
        text = event.data['speechResult']['speech']; 
        app.logger.info("Recognition completed, text=%s, context=%s", text, event.data.get('operationContext')); 
    else: 
        app.logger.info("Recognition completed: data=%s", event.data);

Example of how you can deserialize the RecognizeFailed event:

if event.type == "Microsoft.Communication.RecognizeFailed": 
    app.logger.info("Recognize failed: data=%s", event.data);

Example of how you can deserialize the RecognizeCanceled event:

if event.type == "Microsoft.Communication.RecognizeCanceled":
    # Handle the RecognizeCanceled event according to your application logic

Event codes

Status	Code	Subcode	Message
RecognizeCompleted	200	8531	Action completed, max digits received.
RecognizeCompleted	200	8514	Action completed as stop tone was detected.
RecognizeCompleted	400	8508	Action failed, the operation was canceled.
RecognizeCompleted	400	8532	Action failed, inter-digit silence timeout reached.
RecognizeCanceled	400	8508	Action failed, the operation was canceled.
RecognizeFailed	400	8510	Action failed, initial silence timeout reached.
RecognizeFailed	500	8511	Action failed, encountered failure while trying to play the prompt.
RecognizeFailed	500	8512	Unknown internal server error.
RecognizeFailed	400	8510	Action failed, initial silence timeout reached
RecognizeFailed	400	8532	Action failed, inter-digit silence timeout reached.
RecognizeFailed	400	8565	Action failed, bad request to Azure AI services. Check input parameters.
Recognize Failed	400	8565	Action failed, bad request to Azure AI services. Unable to process payload provided, check the play source input
RecognizeFailed	401	8565	Action failed, Azure AI services authentication error.
RecognizeFailed	403	8565	Action failed, forbidden request to Azure AI services, free subscription used by the request ran out of quota.
RecognizeFailed	429	8565	Action failed, requests exceeded the number of allowed concurrent requests for the Azure AI services subscription.
RecognizeFailed	408	8565	Action failed, request to Azure AI services timed out.
RecognizeFailed	500	8511	Action failed, encountered failure while trying to play the prompt.
RecognizeFailed	500	8512	Unknown internal server error.

Known limitations

In-band DTMF is not supported, use RFC 2833 DTMF instead.
Text-to-Speech text prompts support a maximum of 400 characters, if your prompt is longer than this we suggest using SSML for Text-to-Speech based play actions.
For scenarios where you exceed your Speech service quota limit, you can request to increase this lilmit by following the steps outlined here.

Clean up resources

If you want to clean up and remove a Communication Services subscription, you can delete the resource or resource group. Deleting the resource group also deletes any other resources associated with it. Learn more about cleaning up resources.

Next Steps

Learn more about Gathering user input
Learn more about Playing audio in call
Learn more about Call Automation

Gather user input with Recognize action

Prerequisites

For AI features

Technical specifications

Create a new C# application

Install the NuGet package

Establish a call

Call the recognize action

DTMF

Speech-to-Text Choices

Speech-to-Text

Speech-to-Text or DTMF

Receiving recognize event updates

Example of how you can deserialize the RecognizeCompleted event:

Example of how you can deserialize the RecognizeFailed event:

Example of how you can deserialize the RecognizeCanceled event:

Prerequisites

For AI features

Technical specifications

Create a new Java application

Add package references

Establish a call

Call the recognize action

DTMF

Speech-to-Text Choices

Speech-to-Text

Speech-to-Text or DTMF

Receiving recognize event updates

Example of how you can deserialize the RecognizeCompleted event:

Example of how you can deserialize the RecognizeFailed event:

Example of how you can deserialize the RecognizeCanceled event:

Prerequisites

For AI features

Technical specifications

Create a new JavaScript application

Install the Azure Communication Services Call Automation package

Establish a call

Call the recognize action

DTMF

Speech-to-Text Choices

Speech-to-Text

Speech-to-Text or DTMF

Receiving recognize event updates

Example of how you can deserialize the RecognizeCompleted event:

Example of how you can deserialize the RecognizeFailed event:

Example of how you can deserialize the RecognizeCanceled event:

Prerequisites

For AI features

Technical specifications

Create a new Python application

Set up a Python virtual environment for your project

Activate your virtual environment

Install the Azure Communication Services Call Automation package

Establish a call

Call the recognize action

DTMF

Speech-to-Text Choices

Speech-to-Text

Speech-to-Text or DTMF

Receiving recognize event updates

Example of how you can deserialize the RecognizeCompleted event:

Example of how you can deserialize the RecognizeFailed event:

Example of how you can deserialize the RecognizeCanceled event:

Event codes

Known limitations

Clean up resources

Next Steps

Feedback

Additional resources