Share via

Using azure openai realtime API, I sometimes get mismatches in audio and audio_transcript responses

Tim 5 Reputation points
2025-10-24T22:52:59.3033333+00:00

I setup a text-input audio-output realtime api session as follows:

    this.translationWs = await OpenAIRealtimeWS.azure(
      new AzureOpenAI({
        apiKey: process.env.AZURE_OPENAI_API_KEY,
        endpoint: process.env.AZURE_OPENAI_ENDPOINT,
        apiVersion: "2024-10-01-preview",
        deployment: "gpt-realtime",
      })
    );

    this.translationWs.socket.on("open", () => {
      this.sendTranslationMessage({
        type: "session.update",
        session: {
          modalities: ["text", "audio"],
          instructions: myCustomInstructions,
          output_audio_format: "g711_ulaw",
          temperature: 0.6,
        },
      });
    });
    this.translationWs.on("response.audio.delta", (data) => {
      processAudio(data.delta);
    });

    this.translationWs.on("response.audio_transcript.done", async (data) => {
      console.log("translation: ", data.transcript);
    });

Then, after making sure I receive the session.updated event, I start passing input text in (I told it to translate some text):

      this.translationWs.send({
        type: "conversation.item.create",
        item: {
          type: "message",
          role: "user",
          content: [
            {
              type: "input_text",
              text: myCustomText,
            },
          ],
        },
      });
      this.translationWs.send({
        type: "response.create",
        response: { modalities: ["text", "audio"], conversation: "none" },
      }); 

Sometimes, the audio I hear as the result of "response.audio.delta" is not matching the text I get from "response.audio_transcript.done".

It often leaves out part of the transcript in the audio. Here are some examples:

transcript: "I'm checking. Yes, I found it. Lowering it to minimum. Okay, much better. Now I can barely hear myself."

audio: "I'm checking. Yes, I found it. Lowering it to minimum. Okay, much better."

transcript: "One moment, brief pause. Yeah, you're right, it's turned off. I didn't know that existed."

audio: "One moment, brief pause. Yeah, you're right, it's turned off."

Another error is that it sometimes randomly translates a part of the transcript. Here are some examples:

transcript: "Exactly, it used to work automatically before, but since the last update I have to press the translate button each time."

audio: "Exacto, it used to work automatically before, but since the last update I have to press the translate button each time."

Sometimes I get both errors mentioned above at once:

transcript: "Please send it to *@ *presa.com. That’s the best one."

audio: "Envíalo por favor a **@**presa.com."

In all cases, the transcript is the correct response, so I need the audio to always match the transcript.

Note: Email addresses reacted a support side.

Azure OpenAI in Foundry Models

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.