你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
注释
此功能目前处于公开预览状态。 此预览版没有附带服务级别协议,建议不要用于生产工作负载。 某些功能可能不受支持或者受限。 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款。
先决条件
- 一份 Azure 订阅。 免费创建一个。
- Python 3.8 或更高版本。 建议使用 Python 3.10 或更高版本,但至少需要 Python 3.8。 如果未安装合适的 Python 版本,则可以按照 VS Code Python 教程中的说明操作,这是在操作系统上安装 Python 的最简单方法。
- 在一个受支持的区域中创建的 Azure AI Foundry 资源 。 有关区域可用性的详细信息,请参阅 Voice Live API 概述文档。
小窍门
若要使用语音实时 API,无需使用 Azure AI Foundry 资源部署音频模型。 语音实时 API 是完全托管的,系统会自动为你部署模型。 有关模型可用性的详细信息,请参阅 Voice Live API 概述文档。
Microsoft Entra ID 先决条件
若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证,你需要:
- 安装使用 Microsoft Entra ID 进行无密钥身份验证所需的 Azure CLI。
- 将
Cognitive Services User
角色分配给用户帐户。 你可以在 Azure 门户的“访问控制(IAM)”“添加角色分配”下分配角色。>
设置
创建新文件夹
voice-live-quickstart
,并使用以下命令转到快速入门文件夹:mkdir voice-live-quickstart && cd voice-live-quickstart
创建虚拟环境。 如果已安装 Python 3.10 或更高版本,则可以使用以下命令创建虚拟环境:
py -3 -m venv .venv .venv\scripts\activate
激活 Python 环境意味着当通过命令行运行
python
或pip
时,你将使用应用程序的.venv
文件夹中包含的 Python 解释器。 可以使用deactivate
命令退出 python 虚拟环境,并在需要时重新激活它。小窍门
建议你创建并激活一个新的 Python 环境,用于安装本教程所需的包。 请勿将包安装到你的全局 Python 安装中。 安装 Python 包时,请务必使用虚拟或 Conda 环境,否则可能会中断 Python 的全局安装。
创建名为 requirements.txt的文件。 将以下包添加到文件:
aiohttp==3.11.18 azure-core==1.34.0 azure-identity==1.22.0 certifi==2025.4.26 cffi==1.17.1 cryptography==44.0.3 numpy==2.2.5 pycparser==2.22 python-dotenv==1.1.0 requests==2.32.3 sounddevice==0.5.1 typing_extensions==4.13.2 urllib3==2.4.0 websockets==15.0.1
安装这些软件包:
pip install -r requirements.txt
若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证,请使用以下命令安装 包:
azure-identity
pip install azure-identity
检索资源信息
需要检索以下信息才能使用 Azure AI Foundry 资源对应用程序进行身份验证:
变量名称 | 价值 |
---|---|
AZURE_VOICE_LIVE_ENDPOINT |
从 Azure 门户检查资源时,可在“密钥和终结点”部分中找到此值。 |
VOICE_LIVE_MODEL |
要使用的模型。 例如,gpt-4o 或 gpt-4o-mini-realtime-preview 。 有关模型可用性的详细信息,请参阅 Voice Live API 概述文档。 |
AZURE_VOICE_LIVE_API_VERSION |
要使用的 API 版本。 例如,2025-05-01-preview 。 |
启动会话
使用以下代码创建
voice-live-quickstart.py
文件:from __future__ import annotations import os import uuid import json import asyncio import base64 import logging import threading import numpy as np import sounddevice as sd from collections import deque from dotenv import load_dotenv from azure.identity import DefaultAzureCredential from azure.core.credentials_async import AsyncTokenCredential from azure.identity.aio import DefaultAzureCredential, get_bearer_token_provider from typing import Dict, Union, Literal, Set from typing_extensions import AsyncIterator, TypedDict, Required from websockets.asyncio.client import connect as ws_connect from websockets.asyncio.client import ClientConnection as AsyncWebsocket from websockets.asyncio.client import HeadersLike from websockets.typing import Data from websockets.exceptions import WebSocketException # This is the main function to run the Voice Live API client. async def main() -> None: # Set environment variables or edit the corresponding values here. endpoint = os.environ.get("AZURE_VOICE_LIVE_ENDPOINT") or "https://your-endpoint.azure.com/" model = os.environ.get("VOICE_LIVE_MODEL") or "gpt-4o" api_version = os.environ.get("AZURE_VOICE_LIVE_API_VERSION") or "2025-05-01-preview" api_key = os.environ.get("AZURE_VOICE_LIVE_API_KEY") or "your_api_key" # For the recommended keyless authentication, get and # use the Microsoft Entra token instead of api_key: scopes = "https://cognitiveservices.azure.com/.default" credential = DefaultAzureCredential() token = await credential.get_token(scopes) client = AsyncAzureVoiceLive( azure_endpoint = endpoint, api_version = api_version, token = token.token, #api_key = api_key, ) async with client.connect(model = model) as connection: session_update = { "type": "session.update", "session": { "instructions": "You are a helpful AI assistant responding in natural, engaging language.", "turn_detection": { "type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": False, "end_of_utterance_detection": { "model": "semantic_detection_v1", "threshold": 0.01, "timeout": 2, }, }, "input_audio_noise_reduction": { "type": "azure_deep_noise_suppression" }, "input_audio_echo_cancellation": { "type": "server_echo_cancellation" }, "voice": { "name": "en-US-Ava:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8, }, }, "event_id": "" } await connection.send(json.dumps(session_update)) print("Session created: ", json.dumps(session_update)) send_task = asyncio.create_task(listen_and_send_audio(connection)) receive_task = asyncio.create_task(receive_audio_and_playback(connection)) keyboard_task = asyncio.create_task(read_keyboard_and_quit()) print("Starting the chat ...") await asyncio.wait([send_task, receive_task, keyboard_task], return_when=asyncio.FIRST_COMPLETED) send_task.cancel() receive_task.cancel() print("Chat done.") # --- End of Main Function --- logger = logging.getLogger(__name__) AUDIO_SAMPLE_RATE = 24000 class AsyncVoiceLiveConnection: _connection: AsyncWebsocket def __init__(self, url: str, additional_headers: HeadersLike) -> None: self._url = url self._additional_headers = additional_headers self._connection = None async def __aenter__(self) -> AsyncVoiceLiveConnection: try: self._connection = await ws_connect(self._url, additional_headers=self._additional_headers) except WebSocketException as e: raise ValueError(f"Failed to establish a WebSocket connection: {e}") return self async def __aexit__(self, exc_type, exc_value, traceback) -> None: if self._connection: await self._connection.close() self._connection = None enter = __aenter__ close = __aexit__ async def __aiter__(self) -> AsyncIterator[Data]: async for data in self._connection: yield data async def recv(self) -> Data: return await self._connection.recv() async def recv_bytes(self) -> bytes: return await self._connection.recv() async def send(self, message: Data) -> None: await self._connection.send(message) class AsyncAzureVoiceLive: def __init__( self, *, azure_endpoint: str | None = None, api_version: str | None = None, token: str | None = None, api_key: str | None = None, ) -> None: self._azure_endpoint = azure_endpoint self._api_version = api_version self._token = token self._api_key = api_key self._connection = None def connect(self, model: str) -> AsyncVoiceLiveConnection: if self._connection is not None: raise ValueError("Already connected to the Voice Live API.") if not model: raise ValueError("Model name is required.") url = f"{self._azure_endpoint.rstrip('/')}/voice-live/realtime?api-version={self._api_version}&model={model}" url = url.replace("https://", "wss://") auth_header = {"Authorization": f"Bearer {self._token}"} if self._token else {"api-key": self._api_key} request_id = uuid.uuid4() headers = {"x-ms-client-request-id": str(request_id), **auth_header} self._connection = AsyncVoiceLiveConnection( url, additional_headers=headers, ) return self._connection class AudioPlayerAsync: def __init__(self): self.queue = deque() self.lock = threading.Lock() self.stream = sd.OutputStream( callback=self.callback, samplerate=AUDIO_SAMPLE_RATE, channels=1, dtype=np.int16, blocksize=2400, ) self.playing = False def callback(self, outdata, frames, time, status): if status: logger.warning(f"Stream status: {status}") with self.lock: data = np.empty(0, dtype=np.int16) while len(data) < frames and len(self.queue) > 0: item = self.queue.popleft() frames_needed = frames - len(data) data = np.concatenate((data, item[:frames_needed])) if len(item) > frames_needed: self.queue.appendleft(item[frames_needed:]) if len(data) < frames: data = np.concatenate((data, np.zeros(frames - len(data), dtype=np.int16))) outdata[:] = data.reshape(-1, 1) def add_data(self, data: bytes): with self.lock: np_data = np.frombuffer(data, dtype=np.int16) self.queue.append(np_data) if not self.playing and len(self.queue) > 10: self.start() def start(self): if not self.playing: self.playing = True self.stream.start() def stop(self): with self.lock: self.queue.clear() self.playing = False self.stream.stop() def terminate(self): with self.lock: self.queue.clear() self.stream.stop() self.stream.close() async def listen_and_send_audio(connection: AsyncVoiceLiveConnection) -> None: logger.info("Starting audio stream ...") stream = sd.InputStream(channels=1, samplerate=AUDIO_SAMPLE_RATE, dtype="int16") try: stream.start() read_size = int(AUDIO_SAMPLE_RATE * 0.02) while True: if stream.read_available >= read_size: data, _ = stream.read(read_size) audio = base64.b64encode(data).decode("utf-8") param = {"type": "input_audio_buffer.append", "audio": audio, "event_id": ""} data_json = json.dumps(param) await connection.send(data_json) except Exception as e: logger.error(f"Audio stream interrupted. {e}") finally: stream.stop() stream.close() logger.info("Audio stream closed.") async def receive_audio_and_playback(connection: AsyncVoiceLiveConnection) -> None: last_audio_item_id = None audio_player = AudioPlayerAsync() logger.info("Starting audio playback ...") try: while True: async for raw_event in connection: event = json.loads(raw_event) print(f"Received event:", {event.get("type")}) if event.get("type") == "session.created": session = event.get("session") logger.info(f"Session created: {session.get('id')}") elif event.get("type") == "response.audio.delta": if event.get("item_id") != last_audio_item_id: last_audio_item_id = event.get("item_id") bytes_data = base64.b64decode(event.get("delta", "")) audio_player.add_data(bytes_data) elif event.get("type") == "error": error_details = event.get("error", {}) error_type = error_details.get("type", "Unknown") error_code = error_details.get("code", "Unknown") error_message = error_details.get("message", "No message provided") raise ValueError(f"Error received: Type={error_type}, Code={error_code}, Message={error_message}") except Exception as e: logger.error(f"Error in audio playback: {e}") finally: audio_player.terminate() logger.info("Playback done.") async def read_keyboard_and_quit() -> None: print("Press 'q' and Enter to quit the chat.") while True: # Run input() in a thread to avoid blocking the event loop user_input = await asyncio.to_thread(input) if user_input.strip().lower() == 'q': print("Quitting the chat...") break if __name__ == "__main__": try: logging.basicConfig( filename='voicelive.log', filemode="w", level=logging.DEBUG, format='%(asctime)s:%(name)s:%(levelname)s:%(message)s' ) load_dotenv() asyncio.run(main()) except Exception as e: print(f"Error: {e}")
使用以下命令登录到 Azure:
az login
运行该 Python 文件。
python voice-live-quickstart.py
Voice Live API 开始根据模型的初始响应返回音频。 可以通过说话来中断模型。 输入“q”退出对话。
输出
脚本的输出将打印到控制台。 你会看到指示连接状态、音频流和播放的消息。 音频通过扬声器或耳机播放。
Session created: {"type": "session.update", "session": {"instructions": "You are a helpful AI assistant responding in natural, engaging language.","turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Ava:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
Starting the chat ...
Received event: {'session.created'}
Press 'q' and Enter to quit the chat.
Received event: {'session.updated'}
Received event: {'input_audio_buffer.speech_started'}
Received event: {'input_audio_buffer.speech_stopped'}
Received event: {'input_audio_buffer.committed'}
Received event: {'conversation.item.input_audio_transcription.completed'}
Received event: {'conversation.item.created'}
Received event: {'response.created'}
Received event: {'response.output_item.added'}
Received event: {'conversation.item.created'}
Received event: {'response.content_part.added'}
Received event: {'response.audio_transcript.delta'}
Received event: {'response.audio_transcript.delta'}
Received event: {'response.audio_transcript.delta'}
REDACTED FOR BREVITY
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
q
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Quitting the chat...
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
REDACTED FOR BREVITY
Received event: {'response.audio.delta'}
Received event: {'response.audio.delta'}
Chat done.
运行的脚本将创建一个与脚本相同的目录中命名 voicelive.log
的日志文件。
logging.basicConfig(
filename='voicelive.log',
filemode="w",
level=logging.DEBUG,
format='%(asctime)s:%(name)s:%(levelname)s:%(message)s'
)
日志文件包含有关与语音实时 API 的连接的信息,包括请求和响应数据。 可以查看日志文件以查看聊天的详细信息。
2025-05-09 06:56:06,821:websockets.client:DEBUG:= connection is CONNECTING
2025-05-09 06:56:07,101:websockets.client:DEBUG:> GET /voice-live/realtime?api-version=2025-05-01-preview&model=gpt-4o HTTP/1.1
<REDACTED FOR BREVITY>
2025-05-09 06:56:07,551:websockets.client:DEBUG:= connection is OPEN
2025-05-09 06:56:07,551:websockets.client:DEBUG:< TEXT '{"event_id":"event_5a7NVdtNBVX9JZVuPc9nYK","typ...es":null,"agent":null}}' [1475 bytes]
2025-05-09 06:56:07,552:websockets.client:DEBUG:> TEXT '{"type": "session.update", "session": {"turn_de....8}}, "event_id": null}' [551 bytes]
2025-05-09 06:56:07,557:__main__:INFO:Starting audio stream ...
2025-05-09 06:56:07,810:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,824:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,844:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,874:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,874:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,905:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,926:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,954:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,954:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...///7/", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:07,974:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:08,004:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:08,035:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:08,035:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
<REDACTED FOR BREVITY>
2025-05-09 06:56:42,957:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAP//", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:42,984:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+/wAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,005:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": .../////", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,034:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+////", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,034:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...CAAMA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,055:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...CAAIA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,084:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,114:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...9//3/", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,114:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...DAAMA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,134:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAIA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,165:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,184:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+//7/", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,214:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": .../////", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,214:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+/wAA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,245:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAIA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,264:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAP//", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,295:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAEA", "event_id": ""}' [1346 bytes]
2025-05-09 06:56:43,295:websockets.client:DEBUG:> CLOSE 1000 (OK) [2 bytes]
2025-05-09 06:56:43,297:websockets.client:DEBUG:= connection is CLOSING
2025-05-09 06:56:43,346:__main__:INFO:Audio stream closed.
2025-05-09 06:56:43,388:__main__:INFO:Playback done.
2025-05-09 06:56:44,512:websockets.client:DEBUG:< CLOSE 1000 (OK) [2 bytes]
2025-05-09 06:56:44,514:websockets.client:DEBUG:< EOF
2025-05-09 06:56:44,514:websockets.client:DEBUG:> EOF
2025-05-09 06:56:44,514:websockets.client:DEBUG:= connection is CLOSED
2025-05-09 06:56:44,514:websockets.client:DEBUG:x closing TCP connection
2025-05-09 06:56:44,514:asyncio:ERROR:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x00000266DD8E5400>
相关内容
- 详细了解 如何使用语音直播 API
- 请参阅 Azure OpenAI 实时 API 参考