Quickstart: Recognize and verify who is speaking

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this quickstart, you learn basic design patterns for speaker recognition by using the Speech SDK, including:

  • Text-dependent and text-independent verification.
  • Speaker identification to identify a voice sample among a group of voices.
  • Deleting voice profiles.

For a high-level look at speaker recognition concepts, see the Overview article. See the Reference node in the left pane for a list of the supported platforms.

Important

Microsoft limits access to speaker recognition. Apply to use it through the Azure AI Speaker Recognition Limited Access Review form. After approval, you can access the Speaker Recognition APIs.

Prerequisites

Install the Speech SDK

Before you start, you must install the Speech SDK. Depending on your platform, use the following instructions:

Import dependencies

To run the examples in this article, include the following using statements at the top of your script:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

Create a speech configuration

To call the Speech service by using the Speech SDK, you need to create a SpeechConfig instance. In this example, you create a SpeechConfig instance by using a subscription key and region. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

Important

Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

public class Program 
{
    static async Task Main(string[] args)
    {
        // replace with your own subscription key 
        string subscriptionKey = "YourSubscriptionKey";
        // replace with your own subscription region 
        string region = "YourSubscriptionRegion";
        var config = SpeechConfig.FromSubscription(subscriptionKey, region);
    }
}

Text-dependent verification

Speaker verification is the act of confirming that a speaker matches a known, or enrolled, voice. The first step is to enroll a voice profile so that the service has something to compare future voice samples against. In this example, you enroll the profile by using a text-dependent strategy, which requires a specific passphrase to use for enrollment and verification. See the reference docs for a list of supported passphrases.

Start by creating the following function in your Program class to enroll a voice profile:

public static async Task VerificationEnroll(SpeechConfig config, Dictionary<string, string> profileMapping)
{
    using (var client = new VoiceProfileClient(config))
    using (var profile = await client.CreateProfileAsync(VoiceProfileType.TextDependentVerification, "en-us"))
    {
        var phraseResult = await client.GetActivationPhrasesAsync(VoiceProfileType.TextDependentVerification, "en-us");
        using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())
        {
            Console.WriteLine($"Enrolling profile id {profile.Id}.");
            // give the profile a human-readable display name
            profileMapping.Add(profile.Id, "Your Name");

            VoiceProfileEnrollmentResult result = null;
            while (result is null || result.RemainingEnrollmentsCount > 0)
            {
                Console.WriteLine($"Speak the passphrase, \"${phraseResult.Phrases[0]}\"");
                result = await client.EnrollProfileAsync(profile, audioInput);
                Console.WriteLine($"Remaining enrollments needed: {result.RemainingEnrollmentsCount}");
                Console.WriteLine("");
            }
            
            if (result.Reason == ResultReason.EnrolledVoiceProfile)
            {
                await SpeakerVerify(config, profile, profileMapping);
            }
            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = VoiceProfileEnrollmentCancellationDetails.FromResult(result);
                Console.WriteLine($"CANCELED {profile.Id}: ErrorCode={cancellation.ErrorCode} ErrorDetails={cancellation.ErrorDetails}");
            }
        }
    }
}

In this function, await client.CreateProfileAsync() is what actually creates the new voice profile. After it's created, you specify how you'll input audio samples by using AudioConfig.FromDefaultMicrophoneInput() in this example to capture audio from your default input device. Next, you enroll audio samples in a while loop that tracks the number of samples remaining, and that are required, for enrollment. In each iteration, client.EnrollProfileAsync(profile, audioInput) prompts you to speak the passphrase into your microphone and adds the sample to the voice profile.

After enrollment is finished, call await SpeakerVerify(config, profile, profileMapping) to verify against the profile you just created. Add another function to define SpeakerVerify.

public static async Task SpeakerVerify(SpeechConfig config, VoiceProfile profile, Dictionary<string, string> profileMapping)
{
    var speakerRecognizer = new SpeakerRecognizer(config, AudioConfig.FromDefaultMicrophoneInput());
    var model = SpeakerVerificationModel.FromProfile(profile);

    Console.WriteLine("Speak the passphrase to verify: \"My voice is my passport, please verify me.\"");
    var result = await speakerRecognizer.RecognizeOnceAsync(model);
    Console.WriteLine($"Verified voice profile for speaker {profileMapping[result.ProfileId]}, score is {result.Score}");
}

In this function, you pass the VoiceProfile object you just created to initialize a model to verify against. Next, await speakerRecognizer.RecognizeOnceAsync(model) prompts you to speak the passphrase again. This time it validates it against your voice profile and returns a similarity score that ranges from 0.0 to 1.0. The result object also returns Accept or Reject, based on whether the passphrase matches.

Next, modify your Main() function to call the new functions you created. Also, note that you create a Dictionary<string, string> to pass by reference through your function calls. The reason for this is that the service doesn't allow storing a human-readable name with a created VoiceProfile, and it only stores an ID number for privacy purposes. In the VerificationEnroll function, you add to this dictionary an entry with the newly created ID, along with a text name. In application development scenarios where you need to display a human-readable name, you must store this mapping somewhere because the service can't store it.

static async Task Main(string[] args)
{
    string subscriptionKey = "YourSubscriptionKey";
    string region = "westus";
    var config = SpeechConfig.FromSubscription(subscriptionKey, region);

    // persist profileMapping if you want to store a record of who the profile is
    var profileMapping = new Dictionary<string, string>();
    await VerificationEnroll(config, profileMapping);

    Console.ReadLine();
}

Run the script. You're prompted to speak the phrase "My voice is my passport, verify me" three times for enrollment, and one more time for verification. The result returned is the similarity score, which you can use to create your own custom thresholds for verification.

Enrolling profile id 87-2cef-4dff-995b-dcefb64e203f.
Speak the passphrase, "My voice is my passport, verify me."
Remaining enrollments needed: 2

Speak the passphrase, "My voice is my passport, verify me."
Remaining enrollments needed: 1

Speak the passphrase, "My voice is my passport, verify me."
Remaining enrollments needed: 0

Speak the passphrase to verify: "My voice is my passport, verify me."
Verified voice profile for speaker Your Name, score is 0.915581

Text-independent verification

In contrast to text-dependent verification, text-independent verification doesn't require three audio samples but does require 20 seconds of total audio.

Make a couple simple changes to your VerificationEnroll function to switch to text-independent verification. First, you change the verification type to VoiceProfileType.TextIndependentVerification. Next, change the while loop to track result.RemainingEnrollmentsSpeechLength, which will continue to prompt you to speak until 20 seconds of audio have been captured.

public static async Task VerificationEnroll(SpeechConfig config, Dictionary<string, string> profileMapping)
{
    using (var client = new VoiceProfileClient(config))
    using (var profile = await client.CreateProfileAsync(VoiceProfileType.TextIndependentVerification, "en-us"))
    {
        var phraseResult = await client.GetActivationPhrasesAsync(VoiceProfileType.TextIndependentVerification, "en-us");
        using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())
        {
            Console.WriteLine($"Enrolling profile id {profile.Id}.");
            // give the profile a human-readable display name
            profileMapping.Add(profile.Id, "Your Name");

            VoiceProfileEnrollmentResult result = null;
            while (result is null || result.RemainingEnrollmentsSpeechLength > TimeSpan.Zero)
            {
                Console.WriteLine($"Speak the activation phrase, \"${phraseResult.Phrases[0]}\"");
                result = await client.EnrollProfileAsync(profile, audioInput);
                Console.WriteLine($"Remaining enrollment audio time needed: {result.RemainingEnrollmentsSpeechLength}");
                Console.WriteLine("");
            }
            
            if (result.Reason == ResultReason.EnrolledVoiceProfile)
            {
                await SpeakerVerify(config, profile, profileMapping);
            }
            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = VoiceProfileEnrollmentCancellationDetails.FromResult(result);
                Console.WriteLine($"CANCELED {profile.Id}: ErrorCode={cancellation.ErrorCode} ErrorDetails={cancellation.ErrorDetails}");
            }
        }
    }
}

Run the program again, and the similarity score is returned.

Enrolling profile id 4tt87d4-f2d3-44ae-b5b4-f1a8d4036ee9.
Speak the activation phrase, "<FIRST ACTIVATION PHRASE>"
Remaining enrollment audio time needed: 00:00:15.3200000

Speak the activation phrase, "<FIRST ACTIVATION PHRASE>"
Remaining enrollment audio time needed: 00:00:09.8100008

Speak the activation phrase, "<FIRST ACTIVATION PHRASE>"
Remaining enrollment audio time needed: 00:00:05.1900000

Speak the activation phrase, "<FIRST ACTIVATION PHRASE>"
Remaining enrollment audio time needed: 00:00:00.8700000

Speak the activation phrase, "<FIRST ACTIVATION PHRASE>"
Remaining enrollment audio time needed: 00:00:00

Speak the passphrase to verify: "My voice is my passport, please verify me."
Verified voice profile for speaker Your Name, score is 0.849409

Speaker identification

Speaker identification is used to determine who is speaking from a given group of enrolled voices. The process is similar to text-independent verification. The main difference is the capability to verify against multiple voice profiles at once rather than verifying against a single profile.

Create a function IdentificationEnroll to enroll multiple voice profiles. The enrollment process for each profile is the same as the enrollment process for text-independent verification. The process requires 20 seconds of audio for each profile. This function accepts a list of strings profileNames and will create a new voice profile for each name in the list. The function returns a list of VoiceProfile objects, which you use in the next function for identifying a speaker.

public static async Task<List<VoiceProfile>> IdentificationEnroll(SpeechConfig config, List<string> profileNames, Dictionary<string, string> profileMapping)
{
    List<VoiceProfile> voiceProfiles = new List<VoiceProfile>();
    using (var client = new VoiceProfileClient(config))
    {
        var phraseResult = await client.GetActivationPhrasesAsync(VoiceProfileType.TextIndependentVerification, "en-us");
        foreach (string name in profileNames)
        {
            using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())
            {
                var profile = await client.CreateProfileAsync(VoiceProfileType.TextIndependentIdentification, "en-us");
                Console.WriteLine($"Creating voice profile for {name}.");
                profileMapping.Add(profile.Id, name);

                VoiceProfileEnrollmentResult result = null;
                while (result is null || result.RemainingEnrollmentsSpeechLength > TimeSpan.Zero)
                {
                    Console.WriteLine($"Speak the activation phrase, \"${phraseResult.Phrases[0]}\" to add to the profile enrollment sample for {name}.");
                    result = await client.EnrollProfileAsync(profile, audioInput);
                    Console.WriteLine($"Remaining enrollment audio time needed: {result.RemainingEnrollmentsSpeechLength}");
                    Console.WriteLine("");
                }
                voiceProfiles.Add(profile);
            }
        }
    }
    return voiceProfiles;
}

Create the following function SpeakerIdentification to submit an identification request. The main difference in this function compared to a speaker verification request is the use of SpeakerIdentificationModel.FromProfiles(), which accepts a list of VoiceProfile objects.

public static async Task SpeakerIdentification(SpeechConfig config, List<VoiceProfile> voiceProfiles, Dictionary<string, string> profileMapping) 
{
    var speakerRecognizer = new SpeakerRecognizer(config, AudioConfig.FromDefaultMicrophoneInput());
    var model = SpeakerIdentificationModel.FromProfiles(voiceProfiles);

    Console.WriteLine("Speak some text to identify who it is from your list of enrolled speakers.");
    var result = await speakerRecognizer.RecognizeOnceAsync(model);
    Console.WriteLine($"The most similar voice profile is {profileMapping[result.ProfileId]} with similarity score {result.Score}");
}

Change your Main() function to the following. You create a list of strings profileNames, which you pass to your IdentificationEnroll() function. You're prompted to create a new voice profile for each name in this list, so you can add more names to create more profiles for friends or colleagues.

static async Task Main(string[] args)
{
    // replace with your own subscription key 
    string subscriptionKey = "YourSubscriptionKey";
    // replace with your own subscription region 
    string region = "YourSubscriptionRegion";
    var config = SpeechConfig.FromSubscription(subscriptionKey, region);

    // persist profileMapping if you want to store a record of who the profile is
    var profileMapping = new Dictionary<string, string>();
    var profileNames = new List<string>() { "Your name", "A friend's name" };
    
    var enrolledProfiles = await IdentificationEnroll(config, profileNames, profileMapping);
    await SpeakerIdentification(config, enrolledProfiles, profileMapping);

    foreach (var profile in enrolledProfiles)
    {
        profile.Dispose();
    }
    Console.ReadLine();
}

Run the script. You're prompted to speak to enroll voice samples for the first profile. After the enrollment is finished, you're prompted to repeat this process for each name in the profileNames list. After each enrollment is finished, you're prompted to have anyone speak. The service then attempts to identify this person from among your enrolled voice profiles.

This example returns only the closest match and its similarity score. To get the full response that includes the top five similarity scores, add string json = result.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult) to your SpeakerIdentification function.

Change audio input type

The examples in this article use the default device microphone as input for audio samples. In scenarios where you need to use audio files instead of microphone input, change any instance of AudioConfig.FromDefaultMicrophoneInput() to AudioConfig.FromWavFileInput(path/to/your/file.wav) to switch to a file input. You can also have mixed inputs by using a microphone for enrollment and files for verification, for example.

Delete voice profile enrollments

To delete an enrolled profile, use the DeleteProfileAsync() function on the VoiceProfileClient object. The following example function shows how to delete a voice profile from a known voice profile ID:

public static async Task DeleteProfile(SpeechConfig config, string profileId) 
{
    using (var client = new VoiceProfileClient(config))
    {
        var profile = new VoiceProfile(profileId);
        await client.DeleteProfileAsync(profile);
    }
}

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this quickstart, you learn basic design patterns for speaker recognition by using the Speech SDK, including:

  • Text-dependent and text-independent verification.
  • Speaker identification to identify a voice sample among a group of voices.
  • Deleting voice profiles.

For a high-level look at speaker recognition concepts, see the Overview article. See the Reference node in the left pane for a list of the supported platforms.

Important

Microsoft limits access to speaker recognition. Apply to use it through the Azure AI Speaker Recognition Limited Access Review form. After approval, you can access the Speaker Recognition APIs.

Prerequisites

Install the Speech SDK

Before you start, you must install the Speech SDK. Depending on your platform, use the following instructions:

Import dependencies

To run the examples in this article, add the following statements at the top of your .cpp file:

#include <iostream>
#include <stdexcept>
// Note: Install the NuGet package Microsoft.CognitiveServices.Speech.
#include <speechapi_cxx.h>

using namespace std;
using namespace Microsoft::CognitiveServices::Speech;

// Note: Change the locale if desired.
auto profile_locale = "en-us";
auto audio_config = Audio::AudioConfig::FromDefaultMicrophoneInput();
auto ticks_per_second = 10000000;

Create a speech configuration

To call the Speech service by using the Speech SDK, create a SpeechConfig class. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Important

Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

shared_ptr<SpeechConfig> GetSpeechConfig()
{
    auto subscription_key = 'PASTE_YOUR_SPEECH_SUBSCRIPTION_KEY_HERE';
    auto region = 'PASTE_YOUR_SPEECH_ENDPOINT_REGION_HERE';
    auto config = SpeechConfig::FromSubscription(subscription_key, region);
    return config;
}

Text-dependent verification

Speaker verification is the act of confirming that a speaker matches a known, or enrolled, voice. The first step is to enroll a voice profile so that the service has something to compare future voice samples against. In this example, you enroll the profile by using a text-dependent strategy, which requires a specific passphrase to use for enrollment and verification. See the reference docs for a list of supported passphrases.

TextDependentVerification function

Start by creating the TextDependentVerification function:

void TextDependentVerification(shared_ptr<VoiceProfileClient> client, shared_ptr<SpeakerRecognizer> recognizer)
{
    std::cout << "Text Dependent Verification:\n\n";
    // Create the profile.
    auto profile = client->CreateProfileAsync(VoiceProfileType::TextDependentVerification, profile_locale).get();
    std::cout << "Created profile ID: " << profile->GetId() << "\n";
    AddEnrollmentsToTextDependentProfile(client, profile);
    SpeakerVerify(profile, recognizer);
    // Delete the profile.
    client->DeleteProfileAsync(profile);
}

This function creates a VoiceProfile object with the CreateProfileAsync method. There are three types of VoiceProfile:

  • TextIndependentIdentification
  • TextDependentVerification
  • TextIndependentVerification

In this case, you pass VoiceProfileType::TextDependentVerification to CreateProfileAsync.

You then call two helper functions that you'll define next, AddEnrollmentsToTextDependentProfile and SpeakerVerify. Finally, call DeleteProfileAsync to clean up the profile.

AddEnrollmentsToTextDependentProfile function

Define the following function to enroll a voice profile:

void AddEnrollmentsToTextDependentProfile(shared_ptr<VoiceProfileClient> client, shared_ptr<VoiceProfile> profile)
{
    shared_ptr<VoiceProfileEnrollmentResult> enroll_result = nullptr;
    auto phraseResult = client->GetActivationPhrasesAsync(profile->GetType(), profile_locale).get();
    auto phrases = phraseResult->GetPhrases();
    while (enroll_result == nullptr || enroll_result->GetEnrollmentInfo(EnrollmentInfoType::RemainingEnrollmentsCount) > 0)
    {
        if (phrases != nullptr && phrases->size() > 0)
        {
            std::cout << "Please say the passphrase, \"" << phrases->at(0) << "\"\n";
            enroll_result = client->EnrollProfileAsync(profile, audio_config).get();
            std::cout << "Remaining enrollments needed: " << enroll_result->GetEnrollmentInfo(EnrollmentInfoType::RemainingEnrollmentsCount) << ".\n";
        }
        else
        {
            std::cout << "No passphrases received, enrollment not attempted.\n\n";
        }
    }
    std::cout << "Enrollment completed.\n\n";
}

In this function, you enroll audio samples in a while loop that tracks the number of samples remaining, and that are required, for enrollment. In each iteration, EnrollProfileAsync prompts you to speak the passphrase into your microphone, and it adds the sample to the voice profile.

SpeakerVerify function

Define SpeakerVerify as follows:

void SpeakerVerify(shared_ptr<VoiceProfile> profile, shared_ptr<SpeakerRecognizer> recognizer)
{
    shared_ptr<SpeakerVerificationModel> model = SpeakerVerificationModel::FromProfile(profile);
    std::cout << "Speak the passphrase to verify: \"My voice is my passport, verify me.\"\n";
    shared_ptr<SpeakerRecognitionResult> result = recognizer->RecognizeOnceAsync(model).get();
    std::cout << "Verified voice profile for speaker: " << result->ProfileId << ". Score is: " << result->GetScore() << ".\n\n";
}

In this function, you create a SpeakerVerificationModel object with the SpeakerVerificationModel::FromProfile method, passing in the VoiceProfile object you created earlier.

Next, SpeechRecognizer::RecognizeOnceAsync prompts you to speak the passphrase again. This time it validates it against your voice profile and returns a similarity score that ranges from 0.0 to 1.0. The SpeakerRecognitionResult object also returns Accept or Reject based on whether the passphrase matches.

Text-independent verification

In contrast to text-dependent verification, text-independent verification doesn't require three audio samples but does require 20 seconds of total audio.

TextIndependentVerification function

Start by creating the TextIndependentVerification function:

void TextIndependentVerification(shared_ptr<VoiceProfileClient> client, shared_ptr<SpeakerRecognizer> recognizer)
{
    std::cout << "Text Independent Verification:\n\n";
    // Create the profile.
    auto profile = client->CreateProfileAsync(VoiceProfileType::TextIndependentVerification, profile_locale).get();
    std::cout << "Created profile ID: " << profile->GetId() << "\n";
    AddEnrollmentsToTextIndependentProfile(client, profile);
    SpeakerVerify(profile, recognizer);
    // Delete the profile.
    client->DeleteProfileAsync(profile);
}

Like the TextDependentVerification function, this function creates a VoiceProfile object with the CreateProfileAsync method.

In this case, you pass VoiceProfileType::TextIndependentVerification to CreateProfileAsync.

You then call two helper functions: AddEnrollmentsToTextIndependentProfile, which you'll define next, and SpeakerVerify, which you defined already. Finally, call DeleteProfileAsync to clean up the profile.

AddEnrollmentsToTextIndependentProfile

Define the following function to enroll a voice profile:

void AddEnrollmentsToTextIndependentProfile(shared_ptr<VoiceProfileClient> client, shared_ptr<VoiceProfile> profile)
{
    shared_ptr<VoiceProfileEnrollmentResult> enroll_result = nullptr;
    auto phraseResult = client->GetActivationPhrasesAsync(profile->GetType(), profile_locale).get();
    auto phrases = phraseResult->GetPhrases();
    while (enroll_result == nullptr || enroll_result->GetEnrollmentInfo(EnrollmentInfoType::RemainingEnrollmentsSpeechLength) > 0)
    {
        if (phrases != nullptr && phrases->size() > 0)
        {
            std::cout << "Please say the activation phrase, \"" << phrases->at(0) << "\"\n";
            enroll_result = client->EnrollProfileAsync(profile, audio_config).get();
            std::cout << "Remaining audio time needed: " << enroll_result->GetEnrollmentInfo(EnrollmentInfoType::RemainingEnrollmentsSpeechLength) / ticks_per_second << " seconds.\n";
        }
        else
        {
            std::cout << "No activation phrases received, enrollment not attempted.\n\n";
        }
    }
    std::cout << "Enrollment completed.\n\n";
}

In this function, you enroll audio samples in a while loop that tracks the number of seconds of audio remaining, and that are required, for enrollment. In each iteration, EnrollProfileAsync prompts you to speak into your microphone, and it adds the sample to the voice profile.

Speaker identification

Speaker identification is used to determine who is speaking from a given group of enrolled voices. The process is similar to text-independent verification. The main difference is the capability to verify against multiple voice profiles at once rather than verifying against a single profile.

TextIndependentIdentification function

Start by creating the TextIndependentIdentification function:

void TextIndependentIdentification(shared_ptr<VoiceProfileClient> client, shared_ptr<SpeakerRecognizer> recognizer)
{
    std::cout << "Speaker Identification:\n\n";
    // Create the profile.
    auto profile = client->CreateProfileAsync(VoiceProfileType::TextIndependentIdentification, profile_locale).get();
    std::cout << "Created profile ID: " << profile->GetId() << "\n";
    AddEnrollmentsToTextIndependentProfile(client, profile);
    SpeakerIdentify(profile, recognizer);
    // Delete the profile.
    client->DeleteProfileAsync(profile);
}

Like the TextDependentVerification and TextIndependentVerification functions, this function creates a VoiceProfile object with the CreateProfileAsync method.

In this case, you pass VoiceProfileType::TextIndependentIdentification to CreateProfileAsync.

You then call two helper functions: AddEnrollmentsToTextIndependentProfile, which you defined already, and SpeakerIdentify, which you'll define next. Finally, call DeleteProfileAsync to clean up the profile.

SpeakerIdentify function

Define the SpeakerIdentify function as follows:

void SpeakerIdentify(shared_ptr<VoiceProfile> profile, shared_ptr<SpeakerRecognizer> recognizer)
{
    shared_ptr<SpeakerIdentificationModel> model = SpeakerIdentificationModel::FromProfiles({ profile });
    // Note: We need at least four seconds of audio after pauses are subtracted.
    std::cout << "Please speak for at least ten seconds to identify who it is from your list of enrolled speakers.\n";
    shared_ptr<SpeakerRecognitionResult> result = recognizer->RecognizeOnceAsync(model).get();
    std::cout << "The most similar voice profile is: " << result->ProfileId << " with similarity score: " << result->GetScore() << ".\n\n";
}

In this function, you create a SpeakerIdentificationModel object with the SpeakerIdentificationModel::FromProfiles method. SpeakerIdentificationModel::FromProfiles accepts a list of VoiceProfile objects. In this case, you pass in the VoiceProfile object you created earlier. If you want, you can pass in multiple VoiceProfile objects, each enrolled with audio samples from a different voice.

Next, SpeechRecognizer::RecognizeOnceAsync prompts you to speak again. This time it compares your voice to the enrolled voice profiles and returns the most similar voice profile.

Main function

Finally, define the main function as follows:

int main()
{
    auto speech_config = GetSpeechConfig();
    auto client = VoiceProfileClient::FromConfig(speech_config);
    auto recognizer = SpeakerRecognizer::FromConfig(speech_config, audio_config);
    TextDependentVerification(client, recognizer);
    TextIndependentVerification(client, recognizer);
    TextIndependentIdentification(client, recognizer);
    std::cout << "End of quickstart.\n";
}

This function calls the functions you defined previously. First, it creates a VoiceProfileClient object and a SpeakerRecognizer object.

auto speech_config = GetSpeechConfig();
auto client = VoiceProfileClient::FromConfig(speech_config);
auto recognizer = SpeakerRecognizer::FromConfig(speech_config, audio_config);

The VoiceProfileClient object is used to create, enroll, and delete voice profiles. The SpeakerRecognizer object is used to validate speech samples against one or more enrolled voice profiles.

Change audio input type

The examples in this article use the default device microphone as input for audio samples. In scenarios where you need to use audio files instead of microphone input, change the following line:

auto audio_config = Audio::AudioConfig::FromDefaultMicrophoneInput();

to:

auto audio_config = Audio::AudioConfig::FromWavFileInput("path/to/your/file.wav");

Or replace any use of audio_config with Audio::AudioConfig::FromWavFileInput. You can also have mixed inputs by using a microphone for enrollment and files for verification, for example.

Reference documentation | Package (Go) | Additional Samples on GitHub

In this quickstart, you learn basic design patterns for speaker recognition by using the Speech SDK, including:

  • Text-dependent and text-independent verification.
  • Speaker identification to identify a voice sample among a group of voices.
  • Deleting voice profiles.

For a high-level look at speaker recognition concepts, see the Overview article. See the Reference node in the left pane for a list of the supported platforms.

Important

Microsoft limits access to speaker recognition. Apply to use it through the Azure AI Speaker Recognition Limited Access Review form. After approval, you can access the Speaker Recognition APIs.

Prerequisites

Set up the environment

Install the Speech SDK for Go. Check the SDK installation guide for any more requirements

Perform independent identification

Follow these steps to create a new GO module.

  1. Open a command prompt where you want the new module, and create a new file named independent-identification.go.

  2. Replace the contents of independent-identification.go with the following code.

    package main
    
    import (
        "bufio"
        "fmt"
        "os"
        "time"
    
        "github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
        "github.com/Microsoft/cognitive-services-speech-sdk-go/common"
        "github.com/Microsoft/cognitive-services-speech-sdk-go/speaker"
        "github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
    )
    
    func GetNewVoiceProfileFromClient(client *speaker.VoiceProfileClient, expectedType common.VoiceProfileType) *speaker.VoiceProfile {
        future := client.CreateProfileAsync(expectedType, "en-US")
        outcome := <-future
        if outcome.Failed() {
            fmt.Println("Got an error creating profile: ", outcome.Error.Error())
            return nil
        }
        profile := outcome.Profile
        _, err := profile.Id()
        if err != nil {
            fmt.Println("Unexpected error creating profile id: ", err)
            return nil
        }
        profileType, err := profile.Type();
        if err != nil {
            fmt.Println("Unexpected error getting profile type: ", err)
            return nil
        }
        if profileType != expectedType {
            fmt.Println("Profile type does not match expected type")
            return nil
        }
        return profile
    }
    
    func EnrollProfile(client *speaker.VoiceProfileClient, profile *speaker.VoiceProfile, audioConfig *audio.AudioConfig) {
        enrollmentReason, currentReason := common.EnrollingVoiceProfile, common.EnrollingVoiceProfile
        var currentResult *speaker.VoiceProfileEnrollmentResult
        expectedEnrollmentCount := 1
        for currentReason == enrollmentReason {
            fmt.Println(`Please speak the following phrase: "I'll talk for a few seconds so you can recognize my voice in the future."`)
            enrollFuture := client.EnrollProfileAsync(profile, audioConfig)
            enrollOutcome := <-enrollFuture
            if enrollOutcome.Failed() {
                fmt.Println("Got an error enrolling profile: ", enrollOutcome.Error.Error())
                return
            }
            currentResult = enrollOutcome.Result
            currentReason = currentResult.Reason
            if currentResult.EnrollmentsCount != expectedEnrollmentCount {
                fmt.Println("Unexpected enrollments for profile: ", currentResult.RemainingEnrollmentsCount)
            }
            expectedEnrollmentCount += 1
        }
        if currentReason != common.EnrolledVoiceProfile {
            fmt.Println("Unexpected result enrolling profile: ", currentResult)
        }
    }
    
    func DeleteProfile(client *speaker.VoiceProfileClient, profile *speaker.VoiceProfile) {
        deleteFuture := client.DeleteProfileAsync(profile)
        deleteOutcome := <-deleteFuture
        if deleteOutcome.Failed() {
            fmt.Println("Got an error deleting profile: ", deleteOutcome.Error.Error())
            return
        }
        result := deleteOutcome.Result
        if result.Reason != common.DeletedVoiceProfile {
            fmt.Println("Unexpected result deleting profile: ", result)
        }
    }
    
    func main() {
        subscription :=  "YourSubscriptionKey"
        region := "YourServiceRegion"
        config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer config.Close()
        client, err := speaker.NewVoiceProfileClientFromConfig(config)
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer client.Close()
        audioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer audioConfig.Close()
        <-time.After(10 * time.Second)
        expectedType := common.VoiceProfileType(1)
    
        profile := GetNewVoiceProfileFromClient(client, expectedType)
        if profile == nil {
            fmt.Println("Error creating profile")
            return
        }
        defer profile.Close()
    
        EnrollProfile(client, profile, audioConfig)
    
        profiles := []*speaker.VoiceProfile{profile}
        model, err := speaker.NewSpeakerIdentificationModelFromProfiles(profiles)
        if err != nil {
            fmt.Println("Error creating Identification model: ", err)
        }
        if model == nil {
            fmt.Println("Error creating Identification model: nil model")
            return
        }
        identifyAudioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer identifyAudioConfig.Close()
        speakerRecognizer, err := speaker.NewSpeakerRecognizerFromConfig(config, identifyAudioConfig)
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        identifyFuture := speakerRecognizer.IdentifyOnceAsync(model)
        identifyOutcome := <-identifyFuture
        if identifyOutcome.Failed() {
            fmt.Println("Got an error identifying profile: ", identifyOutcome.Error.Error())
            return
        }
        identifyResult := identifyOutcome.Result
        if identifyResult.Reason != common.RecognizedSpeakers {
            fmt.Println("Got an unexpected result identifying profile: ", identifyResult)
        }
        expectedID, _ := profile.Id()
        if identifyResult.ProfileID != expectedID {
            fmt.Println("Got an unexpected profile id identifying profile: ", identifyResult.ProfileID)
        }
        if identifyResult.Score < 1.0 {
            fmt.Println("Got an unexpected score identifying profile: ", identifyResult.Score)
        }
    
        DeleteProfile(client, profile)
        bufio.NewReader(os.Stdin).ReadBytes('\n')
    }
    
  3. In independent-identification.go, replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

Run the following commands to create a go.mod file that links to components hosted on GitHub:

go mod init independent-identification
go get github.com/Microsoft/cognitive-services-speech-sdk-go

Now build and run the code:

go build
go run independent-identification

Perform independent verification

Follow these steps to create a new GO module.

  1. Open a command prompt where you want the new module, and create a new file named independent-verification.go.

  2. Replace the contents of independent-verification.go with the following code.

    package main
    
    import (
        "bufio"
        "fmt"
        "os"
        "time"
    
        "github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
        "github.com/Microsoft/cognitive-services-speech-sdk-go/common"
        "github.com/Microsoft/cognitive-services-speech-sdk-go/speaker"
        "github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
    )
    
    func GetNewVoiceProfileFromClient(client *speaker.VoiceProfileClient, expectedType common.VoiceProfileType) *speaker.VoiceProfile {
        future := client.CreateProfileAsync(expectedType, "en-US")
        outcome := <-future
        if outcome.Failed() {
            fmt.Println("Got an error creating profile: ", outcome.Error.Error())
            return nil
        }
        profile := outcome.Profile
        _, err := profile.Id()
        if err != nil {
            fmt.Println("Unexpected error creating profile id: ", err)
            return nil
        }
        profileType, err := profile.Type();
        if err != nil {
            fmt.Println("Unexpected error getting profile type: ", err)
            return nil
        }
        if profileType != expectedType {
            fmt.Println("Profile type does not match expected type")
            return nil
        }
        return profile
    }
    
    func EnrollProfile(client *speaker.VoiceProfileClient, profile *speaker.VoiceProfile, audioConfig *audio.AudioConfig) {
        enrollmentReason, currentReason := common.EnrollingVoiceProfile, common.EnrollingVoiceProfile
        var currentResult *speaker.VoiceProfileEnrollmentResult
        expectedEnrollmentCount := 1
        for currentReason == enrollmentReason {
            fmt.Println(`Please speak the following phrase: "I'll talk for a few seconds so you can recognize my voice in the future."`)
            enrollFuture := client.EnrollProfileAsync(profile, audioConfig)
            enrollOutcome := <-enrollFuture
            if enrollOutcome.Failed() {
                fmt.Println("Got an error enrolling profile: ", enrollOutcome.Error.Error())
                return
            }
            currentResult = enrollOutcome.Result
            currentReason = currentResult.Reason
            if currentResult.EnrollmentsCount != expectedEnrollmentCount {
                fmt.Println("Unexpected enrollments for profile: ", currentResult.RemainingEnrollmentsCount)
            }
            expectedEnrollmentCount += 1
        }
        if currentReason != common.EnrolledVoiceProfile {
            fmt.Println("Unexpected result enrolling profile: ", currentResult)
        }
    }
    
    func DeleteProfile(client *speaker.VoiceProfileClient, profile *speaker.VoiceProfile) {
        deleteFuture := client.DeleteProfileAsync(profile)
        deleteOutcome := <-deleteFuture
        if deleteOutcome.Failed() {
            fmt.Println("Got an error deleting profile: ", deleteOutcome.Error.Error())
            return
        }
        result := deleteOutcome.Result
        if result.Reason != common.DeletedVoiceProfile {
            fmt.Println("Unexpected result deleting profile: ", result)
        }
    }
    
    func main() {
        subscription :=  "YourSubscriptionKey"
        region := "YourServiceRegion"
        config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer config.Close()
        client, err := speaker.NewVoiceProfileClientFromConfig(config)
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer client.Close()
        audioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer audioConfig.Close()
        <-time.After(10 * time.Second)
        expectedType := common.VoiceProfileType(3)
    
        profile := GetNewVoiceProfileFromClient(client, expectedType)
        if profile == nil {
            fmt.Println("Error creating profile")
            return
        }
        defer profile.Close()
    
        EnrollProfile(client, profile, audioConfig)
    
        model, err := speaker.NewSpeakerVerificationModelFromProfile(profile)
        if err != nil {
            fmt.Println("Error creating Verification model: ", err)
        }
        if model == nil {
            fmt.Println("Error creating Verification model: nil model")
            return
        }
        verifyAudioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        defer verifyAudioConfig.Close()
        speakerRecognizer, err := speaker.NewSpeakerRecognizerFromConfig(config, verifyAudioConfig)
        if err != nil {
            fmt.Println("Got an error: ", err)
            return
        }
        verifyFuture := speakerRecognizer.VerifyOnceAsync(model)
        verifyOutcome := <-verifyFuture
        if verifyOutcome.Failed() {
            fmt.Println("Got an error verifying profile: ", verifyOutcome.Error.Error())
            return
        }
        verifyResult := verifyOutcome.Result
        if verifyResult.Reason != common.RecognizedSpeaker {
            fmt.Println("Got an unexpected result verifying profile: ", verifyResult)
        }
        expectedID, _ := profile.Id()
        if verifyResult.ProfileID != expectedID {
            fmt.Println("Got an unexpected profile id verifying profile: ", verifyResult.ProfileID)
        }
        if verifyResult.Score < 1.0 {
            fmt.Println("Got an unexpected score verifying profile: ", verifyResult.Score)
        }
    
        DeleteProfile(client, profile)
        bufio.NewReader(os.Stdin).ReadBytes('\n')
    }
    
  3. In independent-verification.go, replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region.

Run the following commands to create a go.mod file that links to components hosted on GitHub:

go mod init independent-verification
go get github.com/Microsoft/cognitive-services-speech-sdk-go

Now build and run the code:

go build
go run independent-verification

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Additional Samples on GitHub

The Speech SDK for Java does support speaker recognition, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Java reference and samples linked from the beginning of this article.

Reference documentation | Package (npm) | Additional Samples on GitHub | Library source code

In this quickstart, you learn basic design patterns for speaker recognition by using the Speech SDK, including:

  • Text-dependent and text-independent verification.
  • Speaker identification to identify a voice sample among a group of voices.
  • Deleting voice profiles.

For a high-level look at speaker recognition concepts, see the Overview article. See the Reference node in the left pane for a list of the supported platforms.

Important

Microsoft limits access to speaker recognition. Apply to use it through the Azure AI Speaker Recognition Limited Access Review form. After approval, you can access the Speaker Recognition APIs.

Prerequisites

Install the Speech SDK

Before you start, you must install the Speech SDK for JavaScript.

Depending on the target environment, use one of the following:

Download and extract the Speech SDK for JavaScript microsoft.cognitiveservices.speech.sdk.bundle.js file. Place it in a folder accessible to your HTML file.

<script src="microsoft.cognitiveservices.speech.sdk.bundle.js"></script>;

Tip

If you're targeting a web browser and using the <script> tag, the sdk prefix isn't needed. The sdk prefix is an alias used to name the require module.

Import dependencies

To run the examples in this article, add the following statements at the top of your .js file:

"use strict";

/* To run this sample, install:
npm install microsoft-cognitiveservices-speech-sdk
*/
var sdk = require("microsoft-cognitiveservices-speech-sdk");
var fs = require("fs");

// Note: Change the locale if desired.
const profile_locale = "en-us";

/* Note: passphrase_files and verify_file should contain paths to audio files that contain \"My voice is my passport, verify me.\"
You can obtain these files from:
https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/fa6428a0837779cbeae172688e0286625e340942/quickstart/javascript/node/speaker-recognition/verification
*/ 
const passphrase_files = ["myVoiceIsMyPassportVerifyMe01.wav", "myVoiceIsMyPassportVerifyMe02.wav", "myVoiceIsMyPassportVerifyMe03.wav"];
const verify_file = "myVoiceIsMyPassportVerifyMe04.wav";
/* Note: identify_file should contain a path to an audio file that uses the same voice as the other files, but contains different speech. You can obtain this file from:
https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/fa6428a0837779cbeae172688e0286625e340942/quickstart/javascript/node/speaker-recognition/identification
*/
const identify_file = "aboutSpeechSdk.wav";

var subscription_key = 'PASTE_YOUR_SPEECH_SUBSCRIPTION_KEY_HERE';
var region = 'PASTE_YOUR_SPEECH_ENDPOINT_REGION_HERE';

const ticks_per_second = 10000000;

These statements import the required libraries and get your Speech service subscription key and region from your environment variables. They also specify paths to audio files that you'll use in the following tasks.

Important

Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

Create a helper function

Add the following helper function to read audio files into streams for use by the Speech service:

function GetAudioConfigFromFile (file)
{
    return sdk.AudioConfig.fromWavFileInput(fs.readFileSync(file));
}

In this function, you use the AudioInputStream.createPushStream and AudioConfig.fromStreamInput methods to create an AudioConfig object. This AudioConfig object represents an audio stream. You'll use several of these AudioConfig objects during the following tasks.

Text-dependent verification

Speaker verification is the act of confirming that a speaker matches a known, or enrolled, voice. The first step is to enroll a voice profile so that the service has something to compare future voice samples against. In this example, you enroll the profile by using a text-dependent strategy, which requires a specific passphrase to use for enrollment and verification. See the reference docs for a list of supported passphrases.

TextDependentVerification function

Start by creating the TextDependentVerification function.

async function TextDependentVerification(client, speech_config)
{
    console.log ("Text Dependent Verification:\n");
    var profile = null;
    try {
        const type = sdk.VoiceProfileType.TextDependentVerification;
        // Create the profile.
        profile = await client.createProfileAsync(type, profile_locale);
        console.log ("Created profile ID: " + profile.profileId);
        // Get the activation phrases
        await GetActivationPhrases(type, profile_locale);
        await AddEnrollmentsToTextDependentProfile(client, profile, passphrase_files);
        const audio_config = GetAudioConfigFromFile(verify_file);
        const recognizer = new sdk.SpeakerRecognizer(speech_config, audio_config);
        await SpeakerVerify(profile, recognizer);
    }
    catch (error) {
        console.log ("Error:\n" + error);
    }
    finally {
        if (profile !== null) {
            console.log ("Deleting profile ID: " + profile.profileId);
            const deleteResult = await client.deleteProfileAsync (profile);
        }
    }
}

This function creates a VoiceProfile object with the VoiceProfileClient.createProfileAsync method. There are three types of VoiceProfile:

  • TextIndependentIdentification
  • TextDependentVerification
  • TextIndependentVerification

In this case, you pass VoiceProfileType.TextDependentVerification to VoiceProfileClient.createProfileAsync.

You then call two helper functions that you'll define next, AddEnrollmentsToTextDependentProfile and SpeakerVerify. Finally, call VoiceProfileClient.deleteProfileAsync to remove the profile.

AddEnrollmentsToTextDependentProfile function

Define the following function to enroll a voice profile:

async function AddEnrollmentsToTextDependentProfile(client, profile, audio_files)
{
    try {
        for (const file of audio_files) {
            console.log ("Adding enrollment to text dependent profile...");
            const audio_config = GetAudioConfigFromFile(file);
            const result = await client.enrollProfileAsync(profile, audio_config);
            if (result.reason === sdk.ResultReason.Canceled) {
                throw(JSON.stringify(sdk.VoiceProfileEnrollmentCancellationDetails.fromResult(result)));
            }
            else {
                console.log ("Remaining enrollments needed: " + result.privDetails["remainingEnrollmentsCount"] + ".");
            }
        };
        console.log ("Enrollment completed.\n");
    } catch (error) {
        console.log ("Error adding enrollments: " + error);
    }
}

In this function, you call the GetAudioConfigFromFile function you defined earlier to create AudioConfig objects from audio samples. These audio samples contain a passphrase, such as "My voice is my passport, verify me." You then enroll these audio samples by using the VoiceProfileClient.enrollProfileAsync method.

SpeakerVerify function

Define SpeakerVerify as follows:

async function SpeakerVerify(profile, recognizer)
{
    try {
        const model = sdk.SpeakerVerificationModel.fromProfile(profile);
        const result = await recognizer.recognizeOnceAsync(model);
        console.log ("Verified voice profile for speaker: " + result.profileId + ". Score is: " + result.score + ".\n");
    } catch (error) {
        console.log ("Error verifying speaker: " + error);
    }
}

In this function, you create a SpeakerVerificationModel object with the SpeakerVerificationModel.FromProfile method, passing in the VoiceProfile object you created earlier.

Next, you call the SpeechRecognizer.recognizeOnceAsync method to validate an audio sample that contains the same passphrase as the audio samples you enrolled previously. SpeechRecognizer.recognizeOnceAsync returns a SpeakerRecognitionResult object, whose score property contains a similarity score that ranges from 0.0 to 1.0. The SpeakerRecognitionResult object also contains a reason property of type ResultReason. If the verification was successful, the reason property should have the value RecognizedSpeaker.

Text-independent verification

In contrast to text-dependent verification, text-independent verification:

  • Doesn't require a certain passphrase to be spoken. Anything can be spoken.
  • Doesn't require three audio samples but does require 20 seconds of total audio.

TextIndependentVerification function

Start by creating the TextIndependentVerification function.

async function TextIndependentVerification(client, speech_config)
{
    console.log ("Text Independent Verification:\n");
    var profile = null;
    try {
        const type = sdk.VoiceProfileType.TextIndependentVerification;
        // Create the profile.
        profile = await client.createProfileAsync(type, profile_locale);
        console.log ("Created profile ID: " + profile.profileId);
        // Get the activation phrases
        await GetActivationPhrases(type, profile_locale);
        await AddEnrollmentsToTextIndependentProfile(client, profile, [identify_file]);
        const audio_config = GetAudioConfigFromFile(passphrase_files[0]);
        const recognizer = new sdk.SpeakerRecognizer(speech_config, audio_config);
        await SpeakerVerify(profile, recognizer);
    }
    catch (error) {
        console.log ("Error:\n" + error);
    }
    finally {
        if (profile !== null) {
            console.log ("Deleting profile ID: " + profile.profileId);
            const deleteResult = await client.deleteProfileAsync (profile);
        }
    }
}

Like the TextDependentVerification function, this function creates a VoiceProfile object with the VoiceProfileClient.createProfileAsync method.

In this case, you pass VoiceProfileType.TextIndependentVerification to createProfileAsync.

You then call two helper functions: AddEnrollmentsToTextIndependentProfile, which you'll define next, and SpeakerVerify, which you defined already. Finally, call VoiceProfileClient.deleteProfileAsync to remove the profile.

AddEnrollmentsToTextIndependentProfile

Define the following function to enroll a voice profile:

async function AddEnrollmentsToTextIndependentProfile(client, profile, audio_files)
{
    try {
        for (const file of audio_files) {
            console.log ("Adding enrollment to text independent profile...");
            const audio_config = GetAudioConfigFromFile(file);
            const result = await client.enrollProfileAsync (profile, audio_config);
            if (result.reason === sdk.ResultReason.Canceled) {
                throw(JSON.stringify(sdk.VoiceProfileEnrollmentCancellationDetails.fromResult(result)));
            }
            else {
                console.log ("Remaining audio time needed: " + (result.privDetails["remainingEnrollmentsSpeechLength"] / ticks_per_second) + " seconds.");
            }
        }
        console.log ("Enrollment completed.\n");
    } catch (error) {
        console.log ("Error adding enrollments: " + error);
    }
}

In this function, you call the GetAudioConfigFromFile function you defined earlier to create AudioConfig objects from audio samples. You then enroll these audio samples by using the VoiceProfileClient.enrollProfileAsync method.

Speaker identification

Speaker identification is used to determine who is speaking from a given group of enrolled voices. The process is similar to text-independent verification. The main difference is the capability to verify against multiple voice profiles at once rather than verifying against a single profile.

TextIndependentIdentification function

Start by creating the TextIndependentIdentification function.

async function TextIndependentIdentification(client, speech_config)
{
    console.log ("Text Independent Identification:\n");
    var profile = null;
    try {
        const type = sdk.VoiceProfileType.TextIndependentIdentification;
        // Create the profile.
        profile = await client.createProfileAsync(type, profile_locale);
        console.log ("Created profile ID: " + profile.profileId);
        // Get the activation phrases
        await GetActivationPhrases(type, profile_locale);
        await AddEnrollmentsToTextIndependentProfile(client, profile, [identify_file]);
        const audio_config = GetAudioConfigFromFile(passphrase_files[0]);
        const recognizer = new sdk.SpeakerRecognizer(speech_config, audio_config);
        await SpeakerIdentify(profile, recognizer);
    }
    catch (error) {
        console.log ("Error:\n" + error);
    }
    finally {
        if (profile !== null) {
            console.log ("Deleting profile ID: " + profile.profileId);
            const deleteResult = await client.deleteProfileAsync (profile);
        }
    }
}

Like the TextDependentVerification and TextIndependentVerification functions, this function creates a VoiceProfile object with the VoiceProfileClient.createProfileAsync method.

In this case, you pass VoiceProfileType.TextIndependentIdentification to VoiceProfileClient.createProfileAsync.

You then call two helper functions: AddEnrollmentsToTextIndependentProfile, which you defined already, and SpeakerIdentify, which you'll define next. Finally, call VoiceProfileClient.deleteProfileAsync to remove the profile.

SpeakerIdentify function

Define the SpeakerIdentify function as follows:

async function SpeakerIdentify(profile, recognizer)
{
    try {
        const model = sdk.SpeakerIdentificationModel.fromProfiles([profile]);
        const result = await recognizer.recognizeOnceAsync(model);
        console.log ("The most similar voice profile is: " + result.profileId + " with similarity score: " + result.score + ".\n");
    } catch (error) {
        console.log ("Error identifying speaker: " + error);
    }
}

In this function, you create a SpeakerIdentificationModel object with the SpeakerIdentificationModel.fromProfiles method, passing in the VoiceProfile object you created earlier.

Next, you call the SpeechRecognizer.recognizeOnceAsync method and pass in an audio sample. SpeechRecognizer.recognizeOnceAsync tries to identify the voice for this audio sample based on the VoiceProfile objects you used to create the SpeakerIdentificationModel. It returns a SpeakerRecognitionResult object, whose profileId property identifies the matching VoiceProfile, if any, while the score property contains a similarity score that ranges from 0.0 to 1.0.

Main function

Finally, define the main function as follows:

async function main() {
    const speech_config = sdk.SpeechConfig.fromSubscription(subscription_key, region);
    const client = new sdk.VoiceProfileClient(speech_config);

    await TextDependentVerification(client, speech_config);
    await TextIndependentVerification(client, speech_config);
    await TextIndependentIdentification(client, speech_config);
    console.log ("End of quickstart.");
}
main();

This function creates a VoiceProfileClient object, which is used to create, enroll, and delete voice profiles. Then it calls the functions you defined previously.

Reference documentation | Package (Download) | Additional Samples on GitHub

The Speech SDK for Objective-C doesn't support speaker recognition. Please select another programming language or the Objective-C reference and samples linked from the beginning of this article.

Reference documentation | Package (Download) | Additional Samples on GitHub

The Speech SDK for Swift doesn't support speaker recognition. Please select another programming language or the Swift reference and samples linked from the beginning of this article.

Reference documentation | Package (PyPi) | Additional Samples on GitHub

The Speech SDK for Python doesn't support speaker recognition. Please select another programming language or the Python reference and samples linked from the beginning of this article.

Speech to text REST API reference | Speech to text REST API for short audio reference | Additional Samples on GitHub

In this quickstart, you learn basic design patterns for speaker recognition by using the Speech SDK, including:

  • Text-dependent and text-independent verification.
  • Speaker identification to identify a voice sample among a group of voices.
  • Deleting voice profiles.

For a high-level look at speaker recognition concepts, see the Overview article. See the Reference node in the left pane for a list of the supported platforms.

Important

Microsoft limits access to speaker recognition. Apply to use it through the Azure AI Speaker Recognition Limited Access Review form. After approval, you can access the Speaker Recognition APIs.

Prerequisites

Text-dependent verification

Speaker verification is the act of confirming that a speaker matches a known, or enrolled, voice. The first step is to enroll a voice profile so that the service has something to compare future voice samples against. In this example, you enroll the profile by using a text-dependent strategy, which requires a specific passphrase to use for enrollment and verification. See the reference docs for a list of supported passphrases.

Start by creating a voice profile. You'll need to insert your Speech service subscription key and region into each of the curl commands in this article.

Important

Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

# Note Change locale if needed.
curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-dependent/profiles?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: application/json' \
--data-raw '{
    '\''locale'\'':'\''en-us'\''
}'

There are three types of voice profile:

  • Text-dependent verification
  • Text-independent verification
  • Text-independent identification

In this case, you create a text-dependent verification voice profile. You should receive the following response:

{
    "remainingEnrollmentsCount": 3,
    "locale": "en-us",
    "createdDateTime": "2020-09-29T14:54:29.683Z",
    "enrollmentStatus": "Enrolling",
    "modelVersion": null,
    "profileId": "714ce523-de76-4220-b93f-7c1cc1882d6e",
    "lastUpdatedDateTime": null,
    "enrollmentsCount": 0,
    "enrollmentsLength": 0.0,
    "enrollmentSpeechLength": 0.0
}

Next, you enroll the voice profile. For the --data-binary parameter value, specify an audio file on your computer that contains one of the supported passphrases, such as "My voice is my passport, verify me." You can record an audio file with an app like Windows Voice Recorder. Or you can generate it by using text to speech.

curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-dependent/profiles/INSERT_PROFILE_ID_HERE/enrollments?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_FILE_PATH_HERE'

You should receive the following response:

{
    "remainingEnrollmentsCount": 2,
    "passPhrase": "my voice is my passport verify me",
    "profileId": "714ce523-de76-4220-b93f-7c1cc1882d6e",
    "enrollmentStatus": "Enrolling",
    "enrollmentsCount": 1,
    "enrollmentsLength": 3.5,
    "enrollmentsSpeechLength": 2.88,
    "audioLength": 3.5,
    "audioSpeechLength": 2.88
}

This response tells you that you need to enroll two more audio samples.

After you enroll a total of three audio samples, you should receive the following response:

{
    "remainingEnrollmentsCount": 0,
    "passPhrase": "my voice is my passport verify me",
    "profileId": "714ce523-de76-4220-b93f-7c1cc1882d6e",
    "enrollmentStatus": "Enrolled",
    "enrollmentsCount": 3,
    "enrollmentsLength": 10.5,
    "enrollmentsSpeechLength": 8.64,
    "audioLength": 3.5,
    "audioSpeechLength": 2.88
}

Now you're ready to verify an audio sample against the voice profile. This audio sample should contain the same passphrase as the samples you used to enroll the voice profile.

curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-dependent/profiles/INSERT_PROFILE_ID_HERE:verify?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_FILE_PATH_HERE'

You should receive the following response:

{
    "recognitionResult": "Accept",
    "score": 1.0
}

Accept means the passphrase matched and the verification was successful. The response also contains a similarity score that ranges from 0.0 to 1.0.

To finish, delete the voice profile.

curl --location --request DELETE \
'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-dependent/profiles/INSERT_PROFILE_ID_HERE?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE'

There's no response.

Text-independent verification

In contrast to text-dependent verification, text-independent verification:

  • Doesn't require a certain passphrase to be spoken. Anything can be spoken.
  • Doesn't require three audio samples but does require 20 seconds of total audio.

Start by creating a text-independent verification profile.

curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-independent/profiles?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: application/json' \
--data-raw '{
    '\''locale'\'':'\''en-us'\''
}'

You should receive the following response:

{
    "profileStatus": "Inactive",
    "remainingEnrollmentsSpeechLength": 20.0,
    "profileId": "3f85dca9-ffc9-4011-bf21-37fad2beb4d2",
    "locale": "en-us",
    "enrollmentStatus": "Enrolling",
    "createdDateTime": "2020-09-29T16:08:52.409Z",
    "lastUpdatedDateTime": null,
    "enrollmentsCount": 0,
    "enrollmentsLength": 0.0,
    "enrollmentSpeechLength": 0.0
    "modelVersion": null,
}

Next, enroll the voice profile. Again, instead of submitting three audio samples, you need to submit audio samples that contain a total of 20 seconds of audio.

curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-independent/profiles/INSERT_PROFILE_ID_HERE/enrollments?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_FILE_PATH_HERE'

After you've submitted enough audio samples, you should receive the following response:

{
    "remainingEnrollmentsSpeechLength": 0.0,
    "profileId": "3f85dca9-ffc9-4011-bf21-37fad2beb4d2",
    "enrollmentStatus": "Enrolled",
    "enrollmentsCount": 1,
    "enrollmentsLength": 33.16,
    "enrollmentsSpeechLength": 29.21,
    "audioLength": 33.16,
    "audioSpeechLength": 29.21
}

Now you're ready to verify an audio sample against the voice profile. Again, this audio sample doesn't need to contain a passphrase. It can contain any speech, but it must contain a total of at least four seconds of audio.

curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-independent/profiles/INSERT_PROFILE_ID_HERE:verify?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_FILE_PATH_HERE'

You should receive the following response:

{
    "recognitionResult": "Accept",
    "score": 0.9196669459342957
}

Accept means the verification was successful. The response also contains a similarity score that ranges from 0.0 to 1.0.

To finish, delete the voice profile.

curl --location --request DELETE 'INSERT_ENDPOINT_HERE/speaker-recognition/verification/text-independent/profiles/INSERT_PROFILE_ID_HERE?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE'

There's no response.

Speaker identification

Speaker identification is used to determine who is speaking from a given group of enrolled voices. The process is similar to text-independent verification. The main difference is the capability to verify against multiple voice profiles at once rather than verifying against a single profile.

Start by creating a text-independent identification profile.

# Note Change locale if needed.
curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/identification/text-independent/profiles?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: application/json' \
--data-raw '{
    '\''locale'\'':'\''en-us'\''
}'

You should receive the following response:

{
    "profileStatus": "Inactive",
    "remainingEnrollmentsSpeechLengthInSec": 20.0,
    "profileId": "de99ab38-36c8-4b82-b137-510907c61fe8",
    "locale": "en-us",
    "enrollmentStatus": "Enrolling",
    "createdDateTime": "2020-09-22T17:25:48.642Z",
    "lastUpdatedDateTime": null,
    "enrollmentsCount": 0,
    "enrollmentsLengthInSec": 0.0,
    "enrollmentsSpeechLengthInSec": 0.0,
    "modelVersion": null
}

Next, you enroll the voice profile. Again, you need to submit audio samples that contain a total of 20 seconds of audio. These samples don't need to contain a passphrase.

curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/identification/text-independent/profiles/INSERT_PROFILE_ID_HERE/enrollments?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_FILE_PATH_HERE'

After you've submitted enough audio samples, you should receive the following response:

{
    "remainingEnrollmentsSpeechLength": 0.0,
    "profileId": "de99ab38-36c8-4b82-b137-510907c61fe8",
    "enrollmentStatus": "Enrolled",
    "enrollmentsCount": 2,
    "enrollmentsLength": 36.69,
    "enrollmentsSpeechLength": 31.95,
    "audioLength": 33.16,
    "audioSpeechLength": 29.21
}

Now you're ready to identify an audio sample by using the voice profile. The identify command accepts a comma-delimited list of possible voice profile IDs. In this case, you'll pass in the ID of the voice profile you created previously. If you want, you can pass in multiple voice profile IDs where each voice profile is enrolled with audio samples from a different voice.

# Profile ids comma seperated list
curl --location --request POST 'INSERT_ENDPOINT_HERE/speaker-recognition/identification/text-independent/profiles:identifySingleSpeaker?api-version=2021-09-05&profileIds=INSERT_PROFILE_ID_HERE' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_FILE_PATH_HERE'

You should receive the following response:

Success:
{
    "identifiedProfile": {
        "profileId": "de99ab38-36c8-4b82-b137-510907c61fe8",
        "score": 0.9083486
    },
    "profilesRanking": [
        {
            "profileId": "de99ab38-36c8-4b82-b137-510907c61fe8",
            "score": 0.9083486
        }
    ]
}

The response contains the ID of the voice profile that most closely matches the audio sample you submitted. It also contains a list of candidate voice profiles, ranked in order of similarity.

To finish, delete the voice profile.

curl --location --request DELETE \
'INSERT_ENDPOINT_HERE/speaker-recognition/identification/text-independent/profiles/INSERT_PROFILE_ID_HERE?api-version=2021-09-05' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE'

There's no response.

The Speech CLI does support speaker recognition, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts.

Next steps