An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Using azure openai realtime API, I sometimes get mismatches in audio and audio_transcript responses
I setup a text-input audio-output realtime api session as follows:
this.translationWs = await OpenAIRealtimeWS.azure(
new AzureOpenAI({
apiKey: process.env.AZURE_OPENAI_API_KEY,
endpoint: process.env.AZURE_OPENAI_ENDPOINT,
apiVersion: "2024-10-01-preview",
deployment: "gpt-realtime",
})
);
this.translationWs.socket.on("open", () => {
this.sendTranslationMessage({
type: "session.update",
session: {
modalities: ["text", "audio"],
instructions: myCustomInstructions,
output_audio_format: "g711_ulaw",
temperature: 0.6,
},
});
});
this.translationWs.on("response.audio.delta", (data) => {
processAudio(data.delta);
});
this.translationWs.on("response.audio_transcript.done", async (data) => {
console.log("translation: ", data.transcript);
});
Then, after making sure I receive the session.updated event, I start passing input text in (I told it to translate some text):
this.translationWs.send({
type: "conversation.item.create",
item: {
type: "message",
role: "user",
content: [
{
type: "input_text",
text: myCustomText,
},
],
},
});
this.translationWs.send({
type: "response.create",
response: { modalities: ["text", "audio"], conversation: "none" },
});
Sometimes, the audio I hear as the result of "response.audio.delta" is not matching the text I get from "response.audio_transcript.done".
It often leaves out part of the transcript in the audio. Here are some examples:
transcript: "I'm checking. Yes, I found it. Lowering it to minimum. Okay, much better. Now I can barely hear myself."
audio: "I'm checking. Yes, I found it. Lowering it to minimum. Okay, much better."
transcript: "One moment, brief pause. Yeah, you're right, it's turned off. I didn't know that existed."
audio: "One moment, brief pause. Yeah, you're right, it's turned off."
Another error is that it sometimes randomly translates a part of the transcript. Here are some examples:
transcript: "Exactly, it used to work automatically before, but since the last update I have to press the translate button each time."
audio: "Exacto, it used to work automatically before, but since the last update I have to press the translate button each time."
Sometimes I get both errors mentioned above at once:
transcript: "Please send it to *@ *presa.com. That’s the best one."
audio: "Envíalo por favor a **@**presa.com."
In all cases, the transcript is the correct response, so I need the audio to always match the transcript.
Note: Email addresses reacted a support side.