リアルタイム文字起こし機能をアプリケーションに追加する

2025-06-06

このガイドは、Call Automation SDK を介してリアルタイム文字起こし機能を提供する Azure Communication Services の、さまざまな使用方法について理解を深めるのに役立ちます。

前提条件

アクティブなサブスクリプションを持つ Azure アカウント。詳細については、アカウントの無料作成に関するページを参照してください。
Azure Communication Services リソース。「Azure Communication Services リソースを作成する」を参照してください。
Azure AI サービスを作成し、Azure Communication Services リソースに接続します。
Azure AI サービスリソースのカスタムサブドメインを作成します。
Call Automation SDK を使用して新しい Web サービスアプリケーションを作成する。

WebSocket サーバーを設定する

Azure Communication Services を利用するには、お使いのサーバーアプリケーションにリアルタイムで文字起こしをストリーミングする WebSocket サーバーが設定されている必要があります。 WebSocket は、単一の TCP 接続で全二重通信チャネルを提供する標準化されたプロトコルです。必要に応じて、WebSocket 接続経由で文字起こしを受信するアプリケーションを作成できる Azure サービス Azure WebApps を使用できます。このクイックスタートに従ってください。

通話を確立する

このクイックスタートでは、呼び出しの開始に既に慣れていることを前提としています。通話の開始と確立の詳細については、クイックスタートに従ってください。このクイックスタートでは、着信呼び出しと発信呼び出しの両方に対して文字起こしを開始するプロセスについて説明します。

リアルタイム文字起こしを使用する場合、文字起こしを開始するタイミングと方法に関するオプションがいくつかあります。

オプション 1 - 呼び出しの応答時または作成時に開始する

オプション 2 - 呼び出しが進行中に文字起こしを開始する

オプション 3 - Azure Communication Services Rooms 通話に接続するときに文字起こしを開始する

このチュートリアルでは、呼び出しの進行中または Rooms 通話への接続時に文字起こしを開始するオプション 2 と 3 について説明します。既定では、"startTranscription" は呼び出しの応答時または作成時に false に設定されています。

呼び出しを作成して文字起こしの詳細を指定する

ACS の TranscriptionOptions を定義して文字起こしを開始するタイミングを指定し、文字起こしのロケールと、文字起こしを送信するための Web ソケット接続を指定します。

var createCallOptions = new CreateCallOptions(callInvite, callbackUri)
{
    CallIntelligenceOptions = new CallIntelligenceOptions() { CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint) },
    TranscriptionOptions = new TranscriptionOptions(new Uri(""), "en-US", false, TranscriptionTransport.Websocket)
};
CreateCallResult createCallResult = await callAutomationClient.CreateCallAsync(createCallOptions);

Rooms 通話に接続し、文字起こしの詳細を指定する

ACS ルームに接続している場合、文字起こしを使用するには、次のように文字起こしオプションを構成します。

var transcriptionOptions = new TranscriptionOptions(
    transportUri: new Uri(""),
    locale: "en-US", 
    startTranscription: false,
    transcriptionTransport: TranscriptionTransport.Websocket,
    //Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    SpeechRecognitionModelEndpointId = "YourCustomSpeechRecognitionModelEndpointId"
);

var connectCallOptions = new ConnectCallOptions(new RoomCallLocator("roomId"), callbackUri)
{
    CallIntelligenceOptions = new CallIntelligenceOptions() 
    { 
        CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint) 
    },
    TranscriptionOptions = transcriptionOptions
};

var connectResult = await client.ConnectCallAsync(connectCallOptions);

文字起こしを開始する

文字起こしを開始する準備ができたら、Call Automation を明示的に呼び出して、呼び出しの文字起こしを開始できます。

// Start transcription with options
StartTranscriptionOptions options = new StartTranscriptionOptions()
{
    OperationContext = "startMediaStreamingContext",
    //Locale = "en-US",
    //Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    SpeechRecognitionModelEndpointId = "YourCustomSpeechRecognitionModelEndpointId"
};

await callMedia.StartTranscriptionAsync(options);

// Alternative: Start transcription without options
// await callMedia.StartTranscriptionAsync();

その他のヘッダー:

トレースの x-ms-call-correlation-id と x-ms-call-connection-idを向上させるために、関連付け ID と呼び出し接続 ID が WebSocket ヘッダーに含まれるようになりました。

文字起こしストリームの受信

文字起こしが始まると、WebSocket は文字起こしメタデータのペイロードを最初のパケットとして受け取ります。

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

文字起こしデータの受信

メタデータの次に Web ソケットが受信するパケットは、文字起こしされたオーディオの TranscriptionData になります。

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Web ソケットサーバーでの文字起こしストリームの処理

using WebServerApi;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

var app = builder.Build();
app.UseWebSockets();
app.Map("/ws", async context =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
        await HandleWebSocket.Echo(webSocket);
    }
    else
    {
        context.Response.StatusCode = StatusCodes.Status400BadRequest;
    }
});

app.Run();

WebSocket ハンドラーでのコードの更新

using Azure.Communication.CallAutomation;
using System.Net.WebSockets;
using System.Text;

namespace WebServerApi
{
    public class HandleWebSocket
    {
        public static async Task Echo(WebSocket webSocket)
        {
            var buffer = new byte[1024 * 4];
            var receiveResult = await webSocket.ReceiveAsync(
                new ArraySegment(buffer), CancellationToken.None);

            while (!receiveResult.CloseStatus.HasValue)
            {
                string msg = Encoding.UTF8.GetString(buffer, 0, receiveResult.Count);
                var response = StreamingDataParser.Parse(msg);

                if (response != null)
                {
                    if (response is AudioMetadata audioMetadata)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("MEDIA SUBSCRIPTION ID-->"+audioMetadata.MediaSubscriptionId);
                        Console.WriteLine("ENCODING-->"+audioMetadata.Encoding);
                        Console.WriteLine("SAMPLE RATE-->"+audioMetadata.SampleRate);
                        Console.WriteLine("CHANNELS-->"+audioMetadata.Channels);
                        Console.WriteLine("LENGTH-->"+audioMetadata.Length);
                        Console.WriteLine("***************************************************************************************");
                    }
                    if (response is AudioData audioData)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("DATA-->"+audioData.Data);
                        Console.WriteLine("TIMESTAMP-->"+audioData.Timestamp);
                        Console.WriteLine("IS SILENT-->"+audioData.IsSilent);
                        Console.WriteLine("***************************************************************************************");
                    }

                    if (response is TranscriptionMetadata transcriptionMetadata)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("TRANSCRIPTION SUBSCRIPTION ID-->"+transcriptionMetadata.TranscriptionSubscriptionId);
                        Console.WriteLine("LOCALE-->"+transcriptionMetadata.Locale);
                        Console.WriteLine("CALL CONNECTION ID--?"+transcriptionMetadata.CallConnectionId);
                        Console.WriteLine("CORRELATION ID-->"+transcriptionMetadata.CorrelationId);
                        Console.WriteLine("***************************************************************************************");
                    }
                    if (response is TranscriptionData transcriptionData)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("TEXT-->"+transcriptionData.Text);
                        Console.WriteLine("FORMAT-->"+transcriptionData.Format);
                        Console.WriteLine("OFFSET-->"+transcriptionData.Offset);
                        Console.WriteLine("DURATION-->"+transcriptionData.Duration);
                        Console.WriteLine("PARTICIPANT-->"+transcriptionData.Participant.RawId);
                        Console.WriteLine("CONFIDENCE-->"+transcriptionData.Confidence);

                        foreach (var word in transcriptionData.Words)
                        {
                            Console.WriteLine("TEXT-->"+word.Text);
                            Console.WriteLine("OFFSET-->"+word.Offset);
                            Console.WriteLine("DURATION-->"+word.Duration);
                        }
                        Console.WriteLine("***************************************************************************************");
                    }
                }

                await webSocket.SendAsync(
                    new ArraySegment(buffer, 0, receiveResult.Count),
                    receiveResult.MessageType,
                    receiveResult.EndOfMessage,
                    CancellationToken.None);

                receiveResult = await webSocket.ReceiveAsync(
                    new ArraySegment(buffer), CancellationToken.None);
            }

            await webSocket.CloseAsync(
                receiveResult.CloseStatus.Value,
                receiveResult.CloseStatusDescription,
                CancellationToken.None);
        }
    }
}

文字起こしを更新する

アプリケーションでユーザーが希望する言語を選択できる場合は、その言語で文字起こしをキャプチャすることもできます。これを行うには、Call Automation SDK を使用して文字起こしロケールを更新できます。

UpdateTranscriptionOptions updateTranscriptionOptions = new UpdateTranscriptionOptions(locale)
{
OperationContext = "UpdateTranscriptionContext",
//Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
SpeechRecognitionModelEndpointId = "YourCustomSpeechRecognitionModelEndpointId"
};

await client.GetCallConnection(callConnectionId).GetCallMedia().UpdateTranscriptionAsync(updateTranscriptionOptions);

文字起こしを停止する

アプリケーションで文字起こしの聞き取りを停止する必要がある場合は、StopTranscription 要求を使用して、Call Automation から Web ソケットへの文字起こしデータの送信を停止することができます。

StopTranscriptionOptions stopOptions = new StopTranscriptionOptions()
{
    OperationContext = "stopTranscription"
};

await callMedia.StopTranscriptionAsync(stopOptions);

呼び出しを作成して文字起こしの詳細を指定する

ACS の TranscriptionOptions を定義して、文字起こしを開始するタイミング、文字起こしのロケール、文字起こしを送信するための Web ソケット接続を指定します。

CallInvite callInvite = new CallInvite(target, caller); 

CallIntelligenceOptions callIntelligenceOptions = new CallIntelligenceOptions()
    .setCognitiveServicesEndpoint(appConfig.getCognitiveServiceEndpoint()); 

TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
    appConfig.getWebSocketUrl(), 
    TranscriptionTransport.WEBSOCKET, 
    "en-US", 
    false,
    "your-endpoint-id-here" // speechRecognitionEndpointId
); 

CreateCallOptions createCallOptions = new CreateCallOptions(callInvite, appConfig.getCallBackUri());
createCallOptions.setCallIntelligenceOptions(callIntelligenceOptions); 
createCallOptions.setTranscriptionOptions(transcriptionOptions); 

Response result = client.createCallWithResponse(createCallOptions, Context.NONE); 
return result.getValue().getCallConnectionProperties().getCallConnectionId();

Rooms 通話に接続し、文字起こしの詳細を指定する

ACS ルームに接続している場合、文字起こしを使用するには、次のように文字起こしオプションを構成します。

TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
    appConfig.getWebSocketUrl(), 
    TranscriptionTransport.WEBSOCKET, 
    "en-US", 
    false,
    "your-endpoint-id-here" // speechRecognitionEndpointId
);

ConnectCallOptions connectCallOptions = new ConnectCallOptions(new RoomCallLocator("roomId"), appConfig.getCallBackUri())
    .setCallIntelligenceOptions(
        new CallIntelligenceOptions()
            .setCognitiveServicesEndpoint(appConfig.getCognitiveServiceEndpoint())
    )
    .setTranscriptionOptions(transcriptionOptions);

ConnectCallResult connectCallResult = Objects.requireNonNull(client
    .connectCallWithResponse(connectCallOptions)
    .block())
    .getValue();

文字起こしを開始する

文字起こしを開始する準備ができたら、Call Automation を明示的に呼び出して、呼び出しの文字起こしを開始できます。

//Option 1: Start transcription with options
StartTranscriptionOptions transcriptionOptions = new StartTranscriptionOptions()
    .setOperationContext("startMediaStreamingContext"); 

client.getCallConnection(callConnectionId)
    .getCallMedia()
    .startTranscriptionWithResponse(transcriptionOptions, Context.NONE); 

// Alternative: Start transcription without options
// client.getCallConnection(callConnectionId)
//     .getCallMedia()
//     .startTranscription();

その他のヘッダー:

文字起こしストリームの受信

文字起こしが始まると、WebSocket は文字起こしメタデータのペイロードを最初のパケットとして受け取ります。

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

文字起こしデータの受信

メタデータの次に Web ソケットが受信するパケットは、文字起こしされたオーディオの TranscriptionData になります。

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Web ソケットサーバーでの文字起こしストリームの処理

package com.example;

import org.glassfish.tyrus.server.Server;

import java.io.BufferedReader;
import java.io.InputStreamReader;

public class App {
    public static void main(String[] args) {
        Server server = new Server("localhost", 8081, "/ws", null, WebSocketServer.class);

        try {
            server.start();
            System.out.println("Web socket running on port 8081...");
            System.out.println("wss://localhost:8081/ws/server");
            BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
            reader.readLine();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            server.stop();
        }
    }
}

WebSocket ハンドラーでのコードの更新

package com.example;

import javax.websocket.OnMessage;
import javax.websocket.Session;
import javax.websocket.server.ServerEndpoint;

import com.azure.communication.callautomation.models.streaming.StreamingData;
import com.azure.communication.callautomation.models.streaming.StreamingDataParser;
import com.azure.communication.callautomation.models.streaming.media.AudioData;
import com.azure.communication.callautomation.models.streaming.media.AudioMetadata;
import com.azure.communication.callautomation.models.streaming.transcription.TranscriptionData;
import com.azure.communication.callautomation.models.streaming.transcription.TranscriptionMetadata;
import com.azure.communication.callautomation.models.streaming.transcription.Word;

@ServerEndpoint("/server")
public class WebSocketServer {
    @OnMessage
    public void onMessage(String message, Session session) {
        StreamingData data = StreamingDataParser.parse(message);

        if (data instanceof AudioMetadata) {
            AudioMetadata audioMetaData = (AudioMetadata) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("SUBSCRIPTION ID: --> " + audioMetaData.getMediaSubscriptionId());
            System.out.println("ENCODING: --> " + audioMetaData.getEncoding());
            System.out.println("SAMPLE RATE: --> " + audioMetaData.getSampleRate());
            System.out.println("CHANNELS: --> " + audioMetaData.getChannels());
            System.out.println("LENGTH: --> " + audioMetaData.getLength());
            System.out.println("----------------------------------------------------------------");
        }

        if (data instanceof AudioData) {
            AudioData audioData = (AudioData) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("DATA: --> " + audioData.getData());
            System.out.println("TIMESTAMP: --> " + audioData.getTimestamp());
            System.out.println("IS SILENT: --> " + audioData.isSilent());
            System.out.println("----------------------------------------------------------------");
        }

        if (data instanceof TranscriptionMetadata) {
            TranscriptionMetadata transcriptionMetadata = (TranscriptionMetadata) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("TRANSCRIPTION SUBSCRIPTION ID: --> " + transcriptionMetadata.getTranscriptionSubscriptionId());
            System.out.println("IS SILENT: --> " + transcriptionMetadata.getLocale());
            System.out.println("CALL CONNECTION ID: --> " + transcriptionMetadata.getCallConnectionId());
            System.out.println("CORRELATION ID: --> " + transcriptionMetadata.getCorrelationId());
            System.out.println("----------------------------------------------------------------");
        }

        if (data instanceof TranscriptionData) {
            TranscriptionData transcriptionData = (TranscriptionData) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("TEXT: --> " + transcriptionData.getText());
            System.out.println("FORMAT: --> " + transcriptionData.getFormat());
            System.out.println("CONFIDENCE: --> " + transcriptionData.getConfidence());
            System.out.println("OFFSET: --> " + transcriptionData.getOffset());
            System.out.println("DURATION: --> " + transcriptionData.getDuration());
            System.out.println("RESULT STATUS: --> " + transcriptionData.getResultStatus());
            for (Word word : transcriptionData.getWords()) {
                System.out.println("Text: --> " + word.getText());
                System.out.println("Offset: --> " + word.getOffset());
                System.out.println("Duration: --> " + word.getDuration());
            }
            System.out.println("----------------------------------------------------------------");
        }
    }
}

文字起こしを更新する

UpdateTranscriptionOptions transcriptionOptions = new UpdateTranscriptionOptions()
    .setLocale(newLocale)
    .setOperationContext("transcriptionContext")
    .setSpeechRecognitionEndpointId("your-endpoint-id-here");

client.getCallConnection(callConnectionId)
    .getCallMedia()
    .updateTranscriptionWithResponse(transcriptionOptions, Context.NONE);

文字起こしを停止する

// Option 1: Stop transcription with options
StopTranscriptionOptions stopTranscriptionOptions = new StopTranscriptionOptions()
    .setOperationContext("stopTranscription");

client.getCallConnection(callConnectionId)
    .getCallMedia()
    .stopTranscriptionWithResponse(stopTranscriptionOptions, Context.NONE);

// Alternative: Stop transcription without options
// client.getCallConnection(callConnectionId)
//     .getCallMedia()
//     .stopTranscription();

呼び出しを作成して文字起こしの詳細を指定する

const transcriptionOptions = {
    transportUrl: "",
    transportType: "websocket",
    locale: "en-US",
    startTranscription: false,
    speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
};

const options = {
    callIntelligenceOptions: {
        cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
    },
    transcriptionOptions: transcriptionOptions
};

console.log("Placing outbound call...");
acsClient.createCall(callInvite, process.env.CALLBACK_URI + "/api/callbacks", options);

Rooms 通話に接続し、文字起こしの詳細を指定する

ACS ルームに接続している場合、文字起こしを使用するには、次のように文字起こしオプションを構成します。

const transcriptionOptions = {
    transportUri: "",
    locale: "en-US",
    transcriptionTransport: "websocket",
    startTranscription: false,
    speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
};

const callIntelligenceOptions = {
    cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
};

const connectCallOptions = {
    callIntelligenceOptions: callIntelligenceOptions,
    transcriptionOptions: transcriptionOptions
};

const callLocator = {
    id: roomId,
    kind: "roomCallLocator"
};

const connectResult = await client.connectCall(callLocator, callBackUri, connectCallOptions);

文字起こしを開始する

文字起こしを開始する準備ができたら、Call Automation を明示的に呼び出して、呼び出しの文字起こしを開始できます。

const startTranscriptionOptions = {
    locale: "en-AU",
    operationContext: "startTranscriptionContext"
};

// Start transcription with options
await callMedia.startTranscription(startTranscriptionOptions);

// Alternative: Start transcription without options
// await callMedia.startTranscription();

その他のヘッダー:

文字起こしストリームの受信

文字起こしが始まると、WebSocket は文字起こしメタデータのペイロードを最初のパケットとして受け取ります。

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

文字起こしデータの受信

メタデータの次に Web ソケットが受信するパケットは、文字起こしされたオーディオの TranscriptionData になります。

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Web ソケットサーバーでの文字起こしストリームの処理

import WebSocket from 'ws';
import { streamingData } from '@azure/communication-call-automation/src/util/streamingDataParser';

const wss = new WebSocket.Server({ port: 8081 });

wss.on('connection', (ws) => {
    console.log('Client connected');

    ws.on('message', (packetData) => {
        const decoder = new TextDecoder();
        const stringJson = decoder.decode(packetData);
        console.log("STRING JSON => " + stringJson);

        const response = streamingData(packetData);
        
        if ('locale' in response) {
            console.log("Transcription Metadata");
            console.log(response.callConnectionId);
            console.log(response.correlationId);
            console.log(response.locale);
            console.log(response.subscriptionId);
        }
        
        if ('text' in response) {
            console.log("Transcription Data");
            console.log(response.text);
            console.log(response.format);
            console.log(response.confidence);
            console.log(response.offset);
            console.log(response.duration);
            console.log(response.resultStatus);

            if ('phoneNumber' in response.participant) {
                console.log(response.participant.phoneNumber);
            }

            response.words.forEach((word) => {
                console.log(word.text);
                console.log(word.duration);
                console.log(word.offset);
            });
        }
    });

    ws.on('close', () => {
        console.log('Client disconnected');
    });
});

console.log('WebSocket server running on port 8081');

文字起こしを更新する

アプリケーションでユーザーが希望する言語を選択できる場合は、その言語で文字起こしをキャプチャすることもできます。このタスクを行うために、Call Automation SDK を使用して文字起こしロケールを更新できます。

async function updateTranscriptionAsync() {
  const options: UpdateTranscriptionOptions = {
operationContext: "updateTranscriptionContext",
speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
  };
  await acsClient
.getCallConnection(callConnectionId)
.getCallMedia()
.updateTranscription("en-au", options);
}

文字起こしを停止する

const stopTranscriptionOptions = {
    operationContext: "stopTranscriptionContext"
};

// Stop transcription with options
await callMedia.stopTranscription(stopTranscriptionOptions);

// Alternative: Stop transcription without options
// await callMedia.stopTranscription();

呼び出しを作成して文字起こしの詳細を指定する

transcription_options = TranscriptionOptions(
    transport_url="WEBSOCKET_URI_HOST",
    transport_type=TranscriptionTransportType.WEBSOCKET,
    locale="en-US",
    start_transcription=False,
    #Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    speech_recognition_model_endpoint_id = "YourCustomSpeechRecognitionModelEndpointId"
);
)

call_connection_properties = call_automation_client.create_call(
    target_participant,
    CALLBACK_EVENTS_URI,
    cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
    source_caller_id_number=source_caller,
    transcription=transcription_options
)

Rooms 通話に接続し、文字起こしの詳細を指定する

ACS ルームに接続している場合、文字起こしを使用するには、次のように文字起こしオプションを構成します。

transcription_options = TranscriptionOptions(
    transport_url="",
    transport_type=TranscriptionTransportType.WEBSOCKET,
    locale="en-US",
    start_transcription=False,
    #Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    speech_recognition_model_endpoint_id = "YourCustomSpeechRecognitionModelEndpointId"
)

connect_result = client.connect_call(
    room_id="roomid",
    CALLBACK_EVENTS_URI,
    cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
    operation_context="connectCallContext",
    transcription=transcription_options
)

文字起こしを開始する

文字起こしを開始する準備ができたら、Call Automation を明示的に呼び出して、呼び出しの文字起こしを開始できます。

# Start transcription without options
call_connection_client.start_transcription()

# Option 1: Start transcription with locale and operation context
# call_connection_client.start_transcription(locale="en-AU", operation_context="startTranscriptionContext")

# Option 2: Start transcription with operation context
# call_connection_client.start_transcription(operation_context="startTranscriptionContext")

その他のヘッダー:

文字起こしストリームの受信

文字起こしが始まると、WebSocket は文字起こしメタデータのペイロードを最初のパケットとして受け取ります。

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

文字起こしデータの受信

メタデータの次に WebSocket が受信するパケットは、文字起こしされたオーディオの TranscriptionData になります。

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Web ソケットサーバーでの文字起こしストリームの処理

import asyncio
import json
import websockets
from azure.communication.callautomation._shared.models import identifier_from_raw_id

async def handle_client(websocket, path):
    print("Client connected")
    try:
        async for message in websocket:
            json_object = json.loads(message)
            kind = json_object['kind']
            if kind == 'TranscriptionMetadata':
                print("Transcription metadata")
                print("-------------------------")
                print("Subscription ID:", json_object['transcriptionMetadata']['subscriptionId'])
                print("Locale:", json_object['transcriptionMetadata']['locale'])
                print("Call Connection ID:", json_object['transcriptionMetadata']['callConnectionId'])
                print("Correlation ID:", json_object['transcriptionMetadata']['correlationId'])
            if kind == 'TranscriptionData':
                participant = identifier_from_raw_id(json_object['transcriptionData']['participantRawID'])
                word_data_list = json_object['transcriptionData']['words']
                print("Transcription data")
                print("-------------------------")
                print("Text:", json_object['transcriptionData']['text'])
                print("Format:", json_object['transcriptionData']['format'])
                print("Confidence:", json_object['transcriptionData']['confidence'])
                print("Offset:", json_object['transcriptionData']['offset'])
                print("Duration:", json_object['transcriptionData']['duration'])
                print("Participant:", participant.raw_id)
                print("Result Status:", json_object['transcriptionData']['resultStatus'])
                for word in word_data_list:
                    print("Word:", word['text'])
                    print("Offset:", word['offset'])
                    print("Duration:", word['duration'])
            
    except websockets.exceptions.ConnectionClosedOK:
        print("Client disconnected")
    except websockets.exceptions.ConnectionClosedError as e:
        print("Connection closed with error: %s", e)
    except Exception as e:
        print("Unexpected error: %s", e)

start_server = websockets.serve(handle_client, "localhost", 8081)

print('WebSocket server running on port 8081')

asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

文字起こしを更新する

await call_automation_client.get_call_connection(
    call_connection_id=call_connection_id
).update_transcription(
    operation_context="UpdateTranscriptionContext",
    locale="en-au",
    #Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
    speech_recognition_model_endpoint_id = "YourCustomSpeechRecognitionModelEndpointId"
)

文字起こしを停止する

# Stop transcription without options
call_connection_client.stop_transcription()

# Alternative: Stop transcription with operation context
# call_connection_client.stop_transcription(operation_context="stopTranscriptionContext")

イベントコード

イベント	コード	サブコード	メッセージ
書き起こしが開始されました	200	0	アクションは正常に完了しました。
トランスクリプション停止	200	0	アクションは正常に完了しました。
書き起こしが更新されました	200	0	アクションは正常に完了しました。
転写に失敗しました	400	8581	アクションが失敗しました。StreamUrl が無効です。
転写に失敗しました	400	8565	Cognitive Services への不正な要求により、アクションが失敗しました。入力パラメーターを確認してください。
転写に失敗しました	400	8565	Cognitive Services への要求がタイムアウトしたため、アクションが失敗しました。後でもう一度やり直すか、サービスに関する問題がないか確認してください。
転写に失敗しました	400	8605	文字起こし用のカスタム音声認識モデルはサポートされていません。
転写に失敗しました	400	8523	無効な要求です。ロケールがありません。
転写に失敗しました	400	8523	無効な要求。リージョン情報を含むロケールのみがサポートされています。
転写に失敗しました	405	8520	現時点では、文字起こし機能はサポートされていません。
転写に失敗しました	405	8520	UpdateTranscription は、Connect インターフェイスで作成された接続ではサポートされていません。
転写に失敗しました	400	8528	アクションが無効です。呼び出しは既に終了しています。
転写に失敗しました	405	8520	現時点では、文字起こし機能の更新はサポートされていません。
転写に失敗しました	405	8522	呼び出しのセットアップ中に文字起こし URL が設定されていない場合、要求は許可されません。
転写に失敗しました	405	8522	呼び出しのセットアップ中に Cognitive Services 構成が設定されていない場合、要求は許可されません。
転写に失敗しました	400	8501	呼び出しが確立済みの状態でない場合、アクションは無効です。
転写に失敗しました	401	8565	Cognitive Services の認証エラーにより、アクションが失敗しました。認証情報の入力を確認し、正しいことを確認してください。
転写に失敗しました	403	8565	Cognitive Services への禁止された要求により、アクションが失敗しました。サブスクリプションの状態がアクティブであることを確認します。
転写に失敗しました	429	8565	アクションが失敗しました。要求が Cognitive Services サブスクリプションで許可されている同時要求の数を超えました。
転写に失敗しました	500	8578	アクションが失敗しました。WebSocket 接続を確立できませんでした。
転写に失敗しました	500	8580	アクションが失敗しました。文字起こしサービスがシャットダウンされました。
転写に失敗しました	500	8579	アクションが失敗しました。文字起こしが取り消されました。
転写に失敗しました	500	9999	不明な内部サーバーエラー。

既知の問題

クライアント SDK startTranscription = True を使用する ACS ユーザーとの 1 対 1 の通話は、現在サポートされていません。

次の方法で共有

リアルタイム文字起こし機能をアプリケーションに追加する

前提条件

WebSocket サーバーを設定する

通話を確立する

呼び出しを作成して文字起こしの詳細を指定する

Rooms 通話に接続し、文字起こしの詳細を指定する

文字起こしを開始する

その他のヘッダー:

文字起こしストリームの受信

文字起こしデータの受信

Web ソケット サーバーでの文字起こしストリームの処理

WebSocket ハンドラーでのコードの更新

文字起こしを更新する

文字起こしを停止する

呼び出しを作成して文字起こしの詳細を指定する

Rooms 通話に接続し、文字起こしの詳細を指定する

文字起こしを開始する

その他のヘッダー:

文字起こしストリームの受信

文字起こしデータの受信

Web ソケット サーバーでの文字起こしストリームの処理

WebSocket ハンドラーでのコードの更新

文字起こしを更新する

文字起こしを停止する

呼び出しを作成して文字起こしの詳細を指定する

Rooms 通話に接続し、文字起こしの詳細を指定する

文字起こしを開始する

その他のヘッダー:

文字起こしストリームの受信

文字起こしデータの受信

Web ソケット サーバーでの文字起こしストリームの処理

文字起こしを更新する

文字起こしを停止する

呼び出しを作成して文字起こしの詳細を指定する

Rooms 通話に接続し、文字起こしの詳細を指定する

文字起こしを開始する

その他のヘッダー:

文字起こしストリームの受信

文字起こしデータの受信

Web ソケット サーバーでの文字起こしストリームの処理

文字起こしを更新する

文字起こしを停止する

イベント コード

既知の問題

フィードバック

その他のリソース

Web ソケットサーバーでの文字起こしストリームの処理

Web ソケットサーバーでの文字起こしストリームの処理

Web ソケットサーバーでの文字起こしストリームの処理

Web ソケットサーバーでの文字起こしストリームの処理

イベントコード