Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure VoiceLive is a managed service that enables low-latency, high-quality speech-to-speech interactions for voice agents. The service consolidates speech recognition, generative AI, and text-to-speech functionalities into a single, unified interface, providing an end-to-end solution for creating seamless voice-driven experiences.
Use the client library to:
- Create real-time voice assistants and conversational agents
- Build speech-to-speech applications with minimal latency
- Integrate advanced conversational features like noise suppression and echo cancellation
- Leverage multiple AI models (GPT-4o, GPT-4o-mini, Phi) for different use cases
- Implement function calling and tool integration for dynamic responses
- Create avatar-enabled voice interactions with visual components
Note: This package supports both browser and Node.js environments. WebSocket connections are used for real-time communication.
Getting started
Currently supported environments
- LTS versions of Node.js
- Latest versions of Safari, Chrome, Edge and Firefox
Prerequisites
- An Azure subscription
- An Azure AI Foundry resource with Voice Live API access
Install the package
Install the Azure VoiceLive client library using npm:
npm install @azure/ai-voicelive
Install the identity library
VoiceLive clients authenticate using the Azure Identity Library. Install it as well:
npm install @azure/identity
Configure TypeScript
TypeScript users need to have Node type definitions installed:
npm install @types/node
You also need to enable compilerOptions.allowSyntheticDefaultImports in your tsconfig.json. Note that if you have enabled compilerOptions.esModuleInterop, allowSyntheticDefaultImports is enabled by default.
JavaScript Bundle
To use this client library in the browser, first you need to use a bundler. For details on how to do this, please refer to our bundling documentation.
Key concepts
VoiceLiveClient
The primary interface for establishing connections to the Azure VoiceLive service. Use this client to authenticate and create sessions for real-time voice interactions.
VoiceLiveSession
Represents an active WebSocket connection for real-time voice communication. This class handles bidirectional communication, allowing you to send audio input and receive audio output, text transcriptions, and other events in real-time.
Session Configuration
The service uses session configuration to control various aspects of voice interaction:
- Turn Detection: Configure how the service detects when users start and stop speaking
- Audio Processing: Enable noise suppression and echo cancellation
- Voice Selection: Choose from standard Azure voices, high-definition voices, or custom voices
- Model Selection: Select the AI model (GPT-4o, GPT-4o-mini, Phi variants) that best fits your needs
Models and Capabilities
The VoiceLive API supports multiple AI models with different capabilities:
| Model | Description | Use Case |
|---|---|---|
gpt-4o-realtime-preview |
GPT-4o with real-time audio processing | High-quality conversational AI |
gpt-4o-mini-realtime-preview |
Lightweight GPT-4o variant | Fast, efficient interactions |
phi4-mm-realtime |
Phi model with multimodal support | Cost-effective voice applications |
Conversational Enhancements
The VoiceLive API provides Azure-specific enhancements:
- Azure Semantic VAD: Advanced voice activity detection that removes filler words
- Noise Suppression: Reduces environmental background noise
- Echo Cancellation: Removes echo from the model's own voice
- End-of-Turn Detection: Allows natural pauses without premature interruption
Authenticating with Azure Active Directory
The VoiceLive service relies on Azure Active Directory to authenticate requests to its APIs. The @azure/identity package provides a variety of credential types that your application can use to do this. The README for @azure/identity provides more details and samples to get you started.
To interact with the Azure VoiceLive service, you need to create an instance of the VoiceLiveClient class, a service endpoint and a credential object. The examples shown in this document use a credential object named DefaultAzureCredential, which is appropriate for most scenarios, including local development and production environments. We recommend using a managed identity for authentication in production environments.
You can find more information on different ways of authenticating and their corresponding credential types in the Azure Identity documentation.
Here's a quick example. First, import DefaultAzureCredential and VoiceLiveClient:
import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";
const credential = new DefaultAzureCredential();
// Build the URL to reach your AI Foundry resource
const endpoint = "https://your-resource.cognitiveservices.azure.com";
// Create the VoiceLive client
const client = new VoiceLiveClient(endpoint, credential);
Authentication with API Key
For development scenarios, you can also authenticate using an API key:
import { AzureKeyCredential } from "@azure/core-auth";
import { VoiceLiveClient } from "@azure/ai-voicelive";
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const credential = new AzureKeyCredential("your-api-key");
const client = new VoiceLiveClient(endpoint, credential);
Examples
The following sections provide code snippets that cover some of the common tasks using Azure VoiceLive. The scenarios covered here consist of:
- Creating a basic voice assistant
- Configuring session options
- Handling real-time events
- Implementing function calling
Creating a basic voice assistant
This example shows how to create a simple voice assistant that can handle speech-to-speech interactions:
import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";
const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
// Create the client
const client = new VoiceLiveClient(endpoint, credential);
// Create and connect a session
const session = await client.startSession("gpt-4o-mini-realtime-preview");
// Configure session for voice conversation
await session.updateSession({
modalities: ["text", "audio"],
instructions: "You are a helpful AI assistant. Respond naturally and conversationally.",
voice: {
type: "azure-standard",
name: "en-US-AvaNeural",
},
turnDetection: {
type: "server_vad",
threshold: 0.5,
prefixPaddingMs: 300,
silenceDurationMs: 500,
},
inputAudioFormat: "pcm16",
outputAudioFormat: "pcm16",
});
Configuring session options
You can customize various aspects of the voice interaction:
import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";
const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const client = new VoiceLiveClient(endpoint, credential);
const session = await client.startSession("gpt-4o-realtime-preview");
// Advanced session configuration
await session.updateSession({
modalities: ["audio", "text"],
instructions: "You are a customer service representative. Be helpful and professional.",
voice: {
type: "azure-custom",
name: "your-custom-voice-name",
endpointId: "your-custom-voice-endpoint",
},
turnDetection: {
type: "server_vad",
threshold: 0.6,
prefixPaddingMs: 200,
silenceDurationMs: 300,
},
inputAudioFormat: "pcm16",
outputAudioFormat: "pcm16",
});
Handling real-time events
The VoiceLive client provides event-driven communication for real-time interactions:
import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";
const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const client = new VoiceLiveClient(endpoint, credential);
const session = await client.startSession("gpt-4o-mini-realtime-preview");
// Set up event handlers using subscription pattern
const subscription = session.subscribe({
onResponseAudioDelta: async (event, context) => {
// Handle incoming audio chunks
const audioData = event.delta;
// Play audio using Web Audio API or other audio system
playAudioChunk(audioData);
},
onResponseTextDelta: async (event, context) => {
// Handle incoming text deltas
console.log("Assistant:", event.delta);
},
onInputAudioTranscriptionCompleted: async (event, context) => {
// Handle user speech transcription
console.log("User said:", event.transcript);
},
});
// Send audio data from microphone
function sendAudioChunk(audioBuffer: ArrayBuffer) {
session.sendAudio(audioBuffer);
}
Implementing function calling
Enable your voice assistant to call external functions and tools:
import { DefaultAzureCredential } from "@azure/identity";
import { VoiceLiveClient } from "@azure/ai-voicelive";
const credential = new DefaultAzureCredential();
const endpoint = "https://your-resource.cognitiveservices.azure.com";
const client = new VoiceLiveClient(endpoint, credential);
const session = await client.startSession("gpt-4o-mini-realtime-preview");
// Define available functions
const tools = [
{
type: "function",
name: "get_weather",
description: "Get current weather for a location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "The city and state or country",
},
},
required: ["location"],
},
},
];
// Configure session with tools
await session.updateSession({
modalities: ["audio", "text"],
instructions:
"You can help users with weather information. Use the get_weather function when needed.",
tools: tools,
toolChoice: "auto",
});
// Handle function calls
const subscription = session.subscribe({
onResponseFunctionCallArgumentsDone: async (event, context) => {
if (event.name === "get_weather") {
const args = JSON.parse(event.arguments);
const weatherData = await getWeatherData(args.location);
// Send function result back
await session.addConversationItem({
type: "function_call_output",
callId: event.callId,
output: JSON.stringify(weatherData),
});
// Request response generation
await session.sendEvent({
type: "response.create",
});
}
},
});
Troubleshooting
Common errors and exceptions
Authentication Errors: If you receive authentication errors, verify that:
- Your Azure AI Foundry resource is correctly configured
- Your API key or credential has the necessary permissions
- The endpoint URL is correct and accessible
WebSocket Connection Issues: VoiceLive uses WebSocket connections. Ensure that:
- Your network allows WebSocket connections
- Firewall rules permit connections to
*.cognitiveservices.azure.com - Browser policies allow WebSocket and microphone access (for browser usage)
Audio Issues: For audio-related problems:
- Verify microphone permissions in the browser
- Check that audio formats (PCM16, PCM24) are supported
- Ensure proper audio context setup for playback
Logging
Enabling logging may help uncover useful information about failures. In order to see a log of WebSocket messages and responses, set the AZURE_LOG_LEVEL environment variable to info. Alternatively, logging can be enabled at runtime by calling setLogLevel in the @azure/logger:
import { setLogLevel } from "@azure/logger";
setLogLevel("info");
For more detailed instructions on how to enable logs, you can look at the @azure/logger package docs.
Next steps
You can find more code samples through the following links:
Contributing
If you'd like to contribute to this library, please read the contributing guide to learn more about how to build and test the code.
Azure SDK for JavaScript