Share via


Azure VoiceLive client library for JavaScript - version 1.0.0-beta.1

Azure VoiceLive is a managed service that enables low-latency, high-quality speech-to-speech interactions for voice agents. The service consolidates speech recognition, generative AI, and text-to-speech functionalities into a single, unified interface, providing an end-to-end solution for creating seamless voice-driven experiences.

Use the client library to:

  • Create real-time voice assistants and conversational agents
  • Build speech-to-speech applications with minimal latency
  • Integrate advanced conversational features like noise suppression and echo cancellation
  • Leverage multiple AI models (GPT-4o, GPT-4o-mini, Phi) for different use cases
  • Implement function calling and tool integration for dynamic responses
  • Create avatar-enabled voice interactions with visual components

Note: This package supports both browser and Node.js environments. WebSocket connections are used for real-time communication.

Getting started

Currently supported environments

Prerequisites

Install the package

Install the Azure VoiceLive client library using npm:

npm install @azure/ai-voicelive

Install the identity library

VoiceLive clients authenticate using the Azure Identity Library. Install it as well:

npm install @azure/identity

Configure TypeScript

TypeScript users need to have Node type definitions installed:

npm install @types/node

You also need to enable compilerOptions.allowSyntheticDefaultImports in your tsconfig.json. Note that if you have enabled compilerOptions.esModuleInterop, allowSyntheticDefaultImports is enabled by default.

JavaScript Bundle

To use this client library in the browser, first you need to use a bundler. For details on how to do this, please refer to our bundling documentation.

Key concepts

VoiceLiveClient

The primary interface for establishing connections to the Azure VoiceLive service. Use this client to authenticate and create sessions for real-time voice interactions.

VoiceLiveSession

Represents an active WebSocket connection for real-time voice communication. This class handles bidirectional communication, allowing you to send audio input and receive audio output, text transcriptions, and other events in real-time.

Session Configuration

The service uses session configuration to control various aspects of voice interaction:

  • Turn Detection: Configure how the service detects when users start and stop speaking
  • Audio Processing: Enable noise suppression and echo cancellation
  • Voice Selection: Choose from standard Azure voices, high-definition voices, or custom voices
  • Model Selection: Select the AI model (GPT-4o, GPT-4o-mini, Phi variants) that best fits your needs

Models and Capabilities

The VoiceLive API supports multiple AI models with different capabilities:

Model Description Use Case
gpt-4o-realtime-preview GPT-4o with real-time audio processing High-quality conversational AI
gpt-4o-mini-realtime-preview Lightweight GPT-4o variant Fast, efficient interactions
phi4-mm-realtime Phi model with multimodal support Cost-effective voice applications

Conversational Enhancements

The VoiceLive API provides Azure-specific enhancements:

  • Azure Semantic VAD: Advanced voice activity detection that removes filler words
  • Noise Suppression: Reduces environmental background noise
  • Echo Cancellation: Removes echo from the model's own voice
  • End-of-Turn Detection: Allows natural pauses without premature interruption

Authenticating with Azure Active Directory

The VoiceLive service relies on Azure Active Directory to authenticate requests to its APIs. The @azure/identity package provides a variety of credential types that your application can use to do this. The README for @azure/identity provides more details and samples to get you started.

To interact with the Azure VoiceLive service, you need to create an instance of the VoiceLiveClient class, a service endpoint and a credential object. The examples shown in this document use a credential object named DefaultAzureCredential, which is appropriate for most scenarios, including local development and production environments. We recommend using a managed identity for authentication in production environments.

You can find more information on different ways of authenticating and their corresponding credential types in the Azure Identity documentation.

Here's a quick example. First, import DefaultAzureCredential and VoiceLiveClient:

import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";

const credential = new DefaultAzureCredential();

// Build the URL to reach your AI Foundry resource
const endpoint = "https://your-resource.cognitiveservices.azure.com";

// Create the VoiceLive client
const client = new VoiceLiveClient(endpoint, credential);

Authentication with API Key

For development scenarios, you can also authenticate using an API key:

import { AzureKeyCredential } from "@azure/core-auth";
import { VoiceLiveClient } from "@azure/ai-voicelive";

const endpoint = "https://your-resource.cognitiveservices.azure.com";
const credential = new AzureKeyCredential("your-api-key");

const client = new VoiceLiveClient(endpoint, credential);

Examples

The following sections provide code snippets that cover some of the common tasks using Azure VoiceLive. The scenarios covered here consist of:

Creating a basic voice assistant

This example shows how to create a simple voice assistant that can handle speech-to-speech interactions:

import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";

const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";

// Create the client
const client = new VoiceLiveClient(endpoint, credential);

// Create and connect a session
const session = await client.startSession("gpt-4o-mini-realtime-preview");

// Configure session for voice conversation
await session.updateSession({
  modalities: ["text", "audio"],
  instructions: "You are a helpful AI assistant. Respond naturally and conversationally.",
  voice: {
    type: "azure-standard",
    name: "en-US-AvaNeural",
  },
  turnDetection: {
    type: "server_vad",
    threshold: 0.5,
    prefixPaddingMs: 300,
    silenceDurationMs: 500,
  },
  inputAudioFormat: "pcm16",
  outputAudioFormat: "pcm16",
});

Configuring session options

You can customize various aspects of the voice interaction:

import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";

const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const client = new VoiceLiveClient(endpoint, credential);
const session = await client.startSession("gpt-4o-realtime-preview");

// Advanced session configuration
await session.updateSession({
  modalities: ["audio", "text"],
  instructions: "You are a customer service representative. Be helpful and professional.",
  voice: {
    type: "azure-custom",
    name: "your-custom-voice-name",
    endpointId: "your-custom-voice-endpoint",
  },
  turnDetection: {
    type: "server_vad",
    threshold: 0.6,
    prefixPaddingMs: 200,
    silenceDurationMs: 300,
  },
  inputAudioFormat: "pcm16",
  outputAudioFormat: "pcm16",
});

Handling real-time events

The VoiceLive client provides event-driven communication for real-time interactions:

import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";

const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const client = new VoiceLiveClient(endpoint, credential);
const session = await client.startSession("gpt-4o-mini-realtime-preview");

// Set up event handlers using subscription pattern
const subscription = session.subscribe({
  onResponseAudioDelta: async (event, context) => {
    // Handle incoming audio chunks
    const audioData = event.delta;
    // Play audio using Web Audio API or other audio system
    playAudioChunk(audioData);
  },

  onResponseTextDelta: async (event, context) => {
    // Handle incoming text deltas
    console.log("Assistant:", event.delta);
  },

  onInputAudioTranscriptionCompleted: async (event, context) => {
    // Handle user speech transcription
    console.log("User said:", event.transcript);
  },
});

// Send audio data from microphone
function sendAudioChunk(audioBuffer: ArrayBuffer) {
  session.sendAudio(audioBuffer);
}

Implementing function calling

Enable your voice assistant to call external functions and tools:

import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";

const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const client = new VoiceLiveClient(endpoint, credential);
const session = await client.startSession("gpt-4o-mini-realtime-preview");

// Define available functions
const tools = [
  {
    type: "function",
    name: "get_weather",
    description: "Get current weather for a location",
    parameters: {
      type: "object",
      properties: {
        location: {
          type: "string",
          description: "The city and state or country",
        },
      },
      required: ["location"],
    },
  },
];

// Configure session with tools
await session.updateSession({
  modalities: ["audio", "text"],
  instructions:
    "You can help users with weather information. Use the get_weather function when needed.",
  tools: tools,
  toolChoice: "auto",
});

// Handle function calls
const subscription = session.subscribe({
  onResponseFunctionCallArgumentsDone: async (event, context) => {
    if (event.name === "get_weather") {
      const args = JSON.parse(event.arguments);
      const weatherData = await getWeatherData(args.location);

      // Send function result back
      await session.addConversationItem({
        type: "function_call_output",
        callId: event.callId,
        output: JSON.stringify(weatherData),
      });

      // Request response generation
      await session.sendEvent({
        type: "response.create",
      });
    }
  },
});

Troubleshooting

Common errors and exceptions

Authentication Errors: If you receive authentication errors, verify that:

  • Your Azure AI Foundry resource is correctly configured
  • Your API key or credential has the necessary permissions
  • The endpoint URL is correct and accessible

WebSocket Connection Issues: VoiceLive uses WebSocket connections. Ensure that:

  • Your network allows WebSocket connections
  • Firewall rules permit connections to *.cognitiveservices.azure.com
  • Browser policies allow WebSocket and microphone access (for browser usage)

Audio Issues: For audio-related problems:

  • Verify microphone permissions in the browser
  • Check that audio formats (PCM16, PCM24) are supported
  • Ensure proper audio context setup for playback

Logging

Enabling logging may help uncover useful information about failures. In order to see a log of WebSocket messages and responses, set the AZURE_LOG_LEVEL environment variable to info. Alternatively, logging can be enabled at runtime by calling setLogLevel in the @azure/logger:

import { setLogLevel } from "@azure/logger";

setLogLevel("info");

For more detailed instructions on how to enable logs, you can look at the @azure/logger package docs.

Next steps

You can find more code samples through the following links:

Contributing

If you'd like to contribute to this library, please read the contributing guide to learn more about how to build and test the code.