你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

快速入门:适用于实时语音代理的语音实时 API(预览版)

注释

此功能目前处于公开预览状态。 此预览版没有附带服务级别协议,建议不要用于生产工作负载。 某些功能可能不受支持或者受限。 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款

先决条件

小窍门

若要使用语音实时 API,无需使用 Azure AI Foundry 资源部署音频模型。 语音实时 API 是完全托管的,系统会自动为你部署模型。 有关模型可用性的详细信息,请参阅 Voice Live API 概述文档

Microsoft Entra ID 先决条件

若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证,你需要:

  • 安装使用 Microsoft Entra ID 进行无密钥身份验证所需的 Azure CLI
  • Cognitive Services User角色分配给用户帐户。 你可以在 Azure 门户的“访问控制(IAM)”“添加角色分配”下分配角色。>

设置

  1. 创建新文件夹 voice-live-quickstart,并使用以下命令转到快速入门文件夹:

    mkdir voice-live-quickstart && cd voice-live-quickstart
    
  2. 创建虚拟环境。 如果已安装 Python 3.10 或更高版本,则可以使用以下命令创建虚拟环境:

    py -3 -m venv .venv
    .venv\scripts\activate
    

    激活 Python 环境意味着当通过命令行运行 pythonpip 时,你将使用应用程序的 .venv 文件夹中包含的 Python 解释器。 可以使用 deactivate 命令退出 python 虚拟环境,并在需要时重新激活它。

    小窍门

    建议你创建并激活一个新的 Python 环境,用于安装本教程所需的包。 请勿将包安装到你的全局 Python 安装中。 安装 Python 包时,请务必使用虚拟或 Conda 环境,否则可能会中断 Python 的全局安装。

  3. 创建名为 requirements.txt的文件。 将以下包添加到文件:

    aiohttp==3.11.18
    azure-core==1.34.0
    azure-identity==1.22.0
    certifi==2025.4.26
    cffi==1.17.1
    cryptography==44.0.3
    numpy==2.2.5
    pycparser==2.22
    python-dotenv==1.1.0
    requests==2.32.3
    sounddevice==0.5.1
    typing_extensions==4.13.2
    urllib3==2.4.0
    websockets==15.0.1
    
  4. 安装这些软件包:

    pip install -r requirements.txt
    
  5. 若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证,请使用以下命令安装 包:azure-identity

    pip install azure-identity
    

检索资源信息

需要检索以下信息才能使用 Azure AI Foundry 资源对应用程序进行身份验证:

变量名称 价值
AZURE_VOICE_LIVE_ENDPOINT 从 Azure 门户检查资源时,可在“密钥和终结点”部分中找到此值。
VOICE_LIVE_MODEL 要使用的模型。 例如,gpt-4ogpt-4o-mini-realtime-preview。 有关模型可用性的详细信息,请参阅 Voice Live API 概述文档
AZURE_VOICE_LIVE_API_VERSION 要使用的 API 版本。 例如,2025-05-01-preview

详细了解无密钥身份验证,以及如何设置环境变量

启动会话

  1. 使用以下代码创建 voice-live-quickstart.py 文件:

    from __future__ import annotations
    
    import os
    import uuid
    import json
    import asyncio
    import base64
    import logging
    import threading
    import numpy as np
    import sounddevice as sd
    
    from collections import deque
    from dotenv import load_dotenv
    from azure.identity import DefaultAzureCredential
    from azure.core.credentials_async import AsyncTokenCredential
    from azure.identity.aio import DefaultAzureCredential, get_bearer_token_provider
    from typing import Dict, Union, Literal, Set
    from typing_extensions import AsyncIterator, TypedDict, Required
    from websockets.asyncio.client import connect as ws_connect
    from websockets.asyncio.client import ClientConnection as AsyncWebsocket
    from websockets.asyncio.client import HeadersLike
    from websockets.typing import Data
    from websockets.exceptions import WebSocketException
    
    # This is the main function to run the Voice Live API client.
    
    async def main() -> None:
        # Set environment variables or edit the corresponding values here.
        endpoint = os.environ.get("AZURE_VOICE_LIVE_ENDPOINT") or "https://your-endpoint.azure.com/"
        model = os.environ.get("VOICE_LIVE_MODEL") or "gpt-4o"
        api_version = os.environ.get("AZURE_VOICE_LIVE_API_VERSION") or "2025-05-01-preview"
        api_key = os.environ.get("AZURE_VOICE_LIVE_API_KEY") or "your_api_key"
    
        # For the recommended keyless authentication, get and
        # use the Microsoft Entra token instead of api_key:
        scopes = "https://cognitiveservices.azure.com/.default"
        credential = DefaultAzureCredential()
        token = await credential.get_token(scopes)
    
        client = AsyncAzureVoiceLive(
            azure_endpoint = endpoint,
            api_version = api_version,
            token = token.token,
            #api_key = api_key,
        )
        async with client.connect(model = model) as connection:
            session_update = {
                "type": "session.update",
                "session": {
                    "instructions": "You are a helpful AI assistant responding in natural, engaging language.",
                    "turn_detection": {
                        "type": "azure_semantic_vad",
                        "threshold": 0.3,
                        "prefix_padding_ms": 200,
                        "silence_duration_ms": 200,
                        "remove_filler_words": False,
                        "end_of_utterance_detection": {
                            "model": "semantic_detection_v1",
                            "threshold": 0.01,
                            "timeout": 2,
                        },
                    },
                    "input_audio_noise_reduction": {
                        "type": "azure_deep_noise_suppression"
                    },
                    "input_audio_echo_cancellation": {
                        "type": "server_echo_cancellation"
                    },
                    "voice": {
                        "name": "en-US-Ava:DragonHDLatestNeural",
                        "type": "azure-standard",
                        "temperature": 0.8,
                    },
                },
                "event_id": ""
            }
            await connection.send(json.dumps(session_update))
            print("Session created: ", json.dumps(session_update))
    
            send_task = asyncio.create_task(listen_and_send_audio(connection))
            receive_task = asyncio.create_task(receive_audio_and_playback(connection))
            keyboard_task = asyncio.create_task(read_keyboard_and_quit())
    
            print("Starting the chat ...")
            await asyncio.wait([send_task, receive_task, keyboard_task], return_when=asyncio.FIRST_COMPLETED)
    
            send_task.cancel()
            receive_task.cancel()
            print("Chat done.")
    
    # --- End of Main Function ---
    
    logger = logging.getLogger(__name__)
    AUDIO_SAMPLE_RATE = 24000
    
    class AsyncVoiceLiveConnection:
        _connection: AsyncWebsocket
    
        def __init__(self, url: str, additional_headers: HeadersLike) -> None:
            self._url = url
            self._additional_headers = additional_headers
            self._connection = None
    
        async def __aenter__(self) -> AsyncVoiceLiveConnection:
            try:
                self._connection = await ws_connect(self._url, additional_headers=self._additional_headers)
            except WebSocketException as e:
                raise ValueError(f"Failed to establish a WebSocket connection: {e}")
            return self
    
        async def __aexit__(self, exc_type, exc_value, traceback) -> None:
            if self._connection:
                await self._connection.close()
                self._connection = None
    
        enter = __aenter__
        close = __aexit__
    
        async def __aiter__(self) -> AsyncIterator[Data]:
             async for data in self._connection:
                 yield data
    
        async def recv(self) -> Data:
            return await self._connection.recv()
    
        async def recv_bytes(self) -> bytes:
            return await self._connection.recv()
    
        async def send(self, message: Data) -> None:
            await self._connection.send(message)
    
    class AsyncAzureVoiceLive:
        def __init__(
            self,
            *,
            azure_endpoint: str | None = None,
            api_version: str | None = None,
            token: str | None = None,
            api_key: str | None = None,
        ) -> None:
    
            self._azure_endpoint = azure_endpoint
            self._api_version = api_version
            self._token = token
            self._api_key = api_key
            self._connection = None
    
        def connect(self, model: str) -> AsyncVoiceLiveConnection:
            if self._connection is not None:
                raise ValueError("Already connected to the Voice Live API.")
            if not model:
                raise ValueError("Model name is required.")
    
            url = f"{self._azure_endpoint.rstrip('/')}/voice-live/realtime?api-version={self._api_version}&model={model}"
            url = url.replace("https://", "wss://")
    
            auth_header = {"Authorization": f"Bearer {self._token}"} if self._token else {"api-key": self._api_key}
            request_id = uuid.uuid4()
            headers = {"x-ms-client-request-id": str(request_id), **auth_header}
    
            self._connection = AsyncVoiceLiveConnection(
                url,
                additional_headers=headers,
            )
            return self._connection
    
    class AudioPlayerAsync:
        def __init__(self):
            self.queue = deque()
            self.lock = threading.Lock()
            self.stream = sd.OutputStream(
                callback=self.callback,
                samplerate=AUDIO_SAMPLE_RATE,
                channels=1,
                dtype=np.int16,
                blocksize=2400,
            )
            self.playing = False
    
        def callback(self, outdata, frames, time, status):
            if status:
                logger.warning(f"Stream status: {status}")
            with self.lock:
                data = np.empty(0, dtype=np.int16)
                while len(data) < frames and len(self.queue) > 0:
                    item = self.queue.popleft()
                    frames_needed = frames - len(data)
                    data = np.concatenate((data, item[:frames_needed]))
                    if len(item) > frames_needed:
                        self.queue.appendleft(item[frames_needed:])
                if len(data) < frames:
                    data = np.concatenate((data, np.zeros(frames - len(data), dtype=np.int16)))
            outdata[:] = data.reshape(-1, 1)
    
        def add_data(self, data: bytes):
            with self.lock:
                np_data = np.frombuffer(data, dtype=np.int16)
                self.queue.append(np_data)
                if not self.playing and len(self.queue) > 10:
                    self.start()
    
        def start(self):
            if not self.playing:
                self.playing = True
                self.stream.start()
    
        def stop(self):
            with self.lock:
                self.queue.clear()
            self.playing = False
            self.stream.stop()
    
        def terminate(self):
            with self.lock:
                self.queue.clear()
            self.stream.stop()
            self.stream.close()
    
    async def listen_and_send_audio(connection: AsyncVoiceLiveConnection) -> None:
        logger.info("Starting audio stream ...")
    
        stream = sd.InputStream(channels=1, samplerate=AUDIO_SAMPLE_RATE, dtype="int16")
        try:
            stream.start()
            read_size = int(AUDIO_SAMPLE_RATE * 0.02)
            while True:
                if stream.read_available >= read_size:
                    data, _ = stream.read(read_size)
                    audio = base64.b64encode(data).decode("utf-8")
                    param = {"type": "input_audio_buffer.append", "audio": audio, "event_id": ""}
                    data_json = json.dumps(param)
                    await connection.send(data_json)
        except Exception as e:
            logger.error(f"Audio stream interrupted. {e}")
        finally:
            stream.stop()
            stream.close()
            logger.info("Audio stream closed.")
    
    async def receive_audio_and_playback(connection: AsyncVoiceLiveConnection) -> None:
        last_audio_item_id = None
        audio_player = AudioPlayerAsync()
    
        logger.info("Starting audio playback ...")
        try:
            while True:
                async for raw_event in connection:
                    event = json.loads(raw_event)
                    print(f"Received event:", {event.get("type")})
    
                    if event.get("type") == "session.created":
                        session = event.get("session")
                        logger.info(f"Session created: {session.get('id')}")
    
                    elif event.get("type") == "response.audio.delta":
                        if event.get("item_id") != last_audio_item_id:
                            last_audio_item_id = event.get("item_id")
    
                        bytes_data = base64.b64decode(event.get("delta", ""))
                        audio_player.add_data(bytes_data)
    
                    elif event.get("type") == "error":
                        error_details = event.get("error", {})
                        error_type = error_details.get("type", "Unknown")
                        error_code = error_details.get("code", "Unknown")
                        error_message = error_details.get("message", "No message provided")
                        raise ValueError(f"Error received: Type={error_type}, Code={error_code}, Message={error_message}")
    
        except Exception as e:
            logger.error(f"Error in audio playback: {e}")
        finally:
            audio_player.terminate()
            logger.info("Playback done.")
    
    async def read_keyboard_and_quit() -> None:
        print("Press 'q' and Enter to quit the chat.")
        while True:
            # Run input() in a thread to avoid blocking the event loop
            user_input = await asyncio.to_thread(input)
            if user_input.strip().lower() == 'q':
                print("Quitting the chat...")
                break
    
    if __name__ == "__main__":
        try:
            logging.basicConfig(
                filename='voicelive.log',
                filemode="w",
                level=logging.DEBUG,
                format='%(asctime)s:%(name)s:%(levelname)s:%(message)s'
            )
            load_dotenv()
            asyncio.run(main())
        except Exception as e:
            print(f"Error: {e}")
    
  2. 使用以下命令登录到 Azure:

    az login
    
  3. 运行该 Python 文件。

    python voice-live-quickstart.py
    
  4. Voice Live API 开始根据模型的初始响应返回音频。 可以通过说话来中断模型。 输入“q”退出对话。

输出

脚本的输出将打印到控制台。 你会看到指示连接状态、音频流和播放的消息。 音频通过扬声器或耳机播放。

Session created:  {"type": "session.update", "session": {"instructions": "You are a helpful AI assistant responding in natural, engaging language.","turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Ava:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
Starting the chat ...
Received event: {'session.created'}
Press 'q' and Enter to quit the chat.
Received event: {'session.updated'}
Received event: {'input_audio_buffer.speech_started'}
Received event: {'input_audio_buffer.speech_stopped'}
Received event: {'input_audio_buffer.committed'}
Received event: {'conversation.item.input_audio_transcription.completed'}
Received event: {'conversation.item.created'}
Received event: {'response.created'}
Received event: {'response.output_item.added'}
Received event: {'conversation.item.created'}
Received event: {'response.content_part.added'}
Received event: {'response.audio_transcript.delta'}
Received event: {'response.audio_transcript.delta'}
Received event: {'response.audio_transcript.delta'}
REDACTED FOR BREVITY
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
q
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Quitting the chat...
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
REDACTED FOR BREVITY
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Chat done.

运行的脚本将创建一个与脚本相同的目录中命名 voicelive.log 的日志文件。

logging.basicConfig(
    filename='voicelive.log',
    filemode="w",
    level=logging.DEBUG,
    format='%(asctime)s:%(name)s:%(levelname)s:%(message)s'
)

日志文件包含有关与语音实时 API 的连接的信息,包括请求和响应数据。 可以查看日志文件以查看聊天的详细信息。

2025-05-09 06:56:06,821:websockets.client:DEBUG:= connection is CONNECTING
2025-05-09 06:56:07,101:websockets.client:DEBUG:> GET /voice-live/realtime?api-version=2025-05-01-preview&model=gpt-4o HTTP/1.1
<REDACTED FOR BREVITY>
2025-05-09 06:56:07,551:websockets.client:DEBUG:= connection is OPEN
2025-05-09 06:56:07,551:websockets.client:DEBUG:< TEXT '{"event_id":"event_5a7NVdtNBVX9JZVuPc9nYK","typ...es":null,"agent":null}}' [1475 bytes]
2025-05-09 06:56:07,552:websockets.client:DEBUG:> TEXT '{"type": "session.update", "session": {"turn_de....8}}, "event_id": null}' [551 bytes]
2025-05-09 06:56:07,557:__main__:INFO:Starting audio stream ...
2025-05-09 06:56:07,810:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,824:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,844:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,874:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,874:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,905:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,926:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,954:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,954:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...///7/", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,974:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:08,004:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:08,035:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:08,035:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
<REDACTED FOR BREVITY>
2025-05-09 06:56:42,957:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAP//", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:42,984:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+/wAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,005:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": .../////", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,034:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+////", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,034:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...CAAMA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,055:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...CAAIA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,084:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,114:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...9//3/", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,114:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...DAAMA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,134:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAIA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,165:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,184:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+//7/", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,214:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": .../////", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,214:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+/wAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,245:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAIA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,264:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAP//", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,295:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,295:websockets.client:DEBUG:> CLOSE 1000 (OK) [2 bytes]
2025-05-09 06:56:43,297:websockets.client:DEBUG:= connection is CLOSING
2025-05-09 06:56:43,346:__main__:INFO:Audio stream closed.
2025-05-09 06:56:43,388:__main__:INFO:Playback done.
2025-05-09 06:56:44,512:websockets.client:DEBUG:< CLOSE 1000 (OK) [2 bytes]
2025-05-09 06:56:44,514:websockets.client:DEBUG:< EOF
2025-05-09 06:56:44,514:websockets.client:DEBUG:> EOF
2025-05-09 06:56:44,514:websockets.client:DEBUG:= connection is CLOSED
2025-05-09 06:56:44,514:websockets.client:DEBUG:x closing TCP connection
2025-05-09 06:56:44,514:asyncio:ERROR:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x00000266DD8E5400>