Azure GPT Realtime. Question on history

Question

Azure GPT Realtime. Question on history

It is VMS 270

Hello

I have the Azure Realtime GPT working well. Production ready deployment.

I have a question on history of the user's previous conversations

    // After the session is configured, data can be sent to the session.    
    realtimeClient.send({
        'type': 'conversation.item.create',
        'item': {
            'type': 'message',
            'role': 'user',
            'content': [{
                type: 'input_text',
                text: 'Please assist the user.'
            }
            ]
        }
    });

I realize that by default, every session with the realtime gpt is a "new" session. And, if possible, i would like to add the user's history, at least send it in text to the Realtime GPT Azure endpoint, so that the GPT "knows" the history. Is this even possible? I tried adding it as a message array, didn't get any errors back from realtime GPT and don't see it working

Any thoughts on this?

Thanks

It is VMS 270 Reputation points

2026-01-23T03:05:14.5433333+00:00
Hi Sridhar

Thanks a million for the detailed note. You are a gem. I am a bit confused, so let me ask in a very functional way

User starts a new conversation. No history. We have "user message 1", and Azure Realtime GPT responds with "AI response 1", and similarly, "user message 2" and AI responses with "AI response 2".

User then disconnects and starts another session.

From what i understand,

On our frontend, we know the user, so we send the history messages as given in the snippet by you

after that we trigger the response.create

Now, here are the questions

let us assume that no errors are thrown back by Azure

Q1. Will the Azure Realtime GPT know the two messages the user had sent and what the AI had responded, so that, in the current conversation, that is part of the "context". For example, if "user message 1" was "I am a vegan. Use this in future when i ask you for options", then, in the 2nd conversation, will the AI know that this person had said that he/she was a vegan?

I do realize that we may run out of context (& if so, then Azure will throw back an error)

Thank you for clarifying this
It is VMS 270 Reputation points

2026-01-23T03:13:21.75+00:00

And, Sridhar, please when you respond, please highlight the impact with implementation of VAD, and how it will get impacted if we use VAD

Thanks again!
Anonymous

2026-01-26T01:21:06.0366667+00:00
Hi It is VMS
session memory does not exist across sessions (with or without VAD) :

Azure Realtime GPT treats each WebRTC/WebSocket connection as a brand‑new session. Whether you enable VAD or not, the model does not retain memory across sessions. When a user disconnects and reconnects, the conversation is empty unless you explicitly resend prior context as conversation.item.create events. This behavior is explicitly confirmed in Microsoft Q&A and official documentation.

When VAD is disabled, you explicitly control when the model should respond by sending response.create.

When VAD is enabled, the server automatically triggers response generation when it detects the end of speech (silence threshold met). This means the model decides when “the turn is complete”, not your application. [Use the GP...soft Learn | Learn.Microsoft.com]

Impact: If required context (for example: “User is vegan”) has not been injected before speech begins, VAD may trigger a response before that context exists in the conversation.

Context replay timing becomes critical with VAD With VAD off, your typical flow is:

Inject history / memory

Inject new user message

Call response.create

With VAD on, the flow becomes:

Server buffers audio

Silence detected

Response is auto‑generated immediatelyContext replay timing becomes critical with VAD With VAD off, your typical flow is:

Inject history / memory

Inject new user message

Call response.create

With VAD on, the flow becomes:

Server buffers audio

Silence detected

Response is auto‑generated immediately

Impact: If you inject user preferences after audio streaming starts, it may be too late. The model may already have committed to a response using only partial context. This is a VAD‑specific risk and does not occur in manual (response.create) mode. [Use the GP...soft Learn | Learn.Microsoft.com]

Vegan example: what breaks specifically because of VAD Session

1: User says “I am vegan” → model responds appropriately.

Session 2 (new session, VAD enabled): User speaks: “What should I eat today?” If vegan preference was not injected before audio intake, VAD detects silence and auto‑triggers response.

Result: Model responds with non‑vegan options. This is not an error, but a sequencing issue introduced by VAD.

Key point: VAD does not remember session‑1 context, and it may respond before you repopulate memory.

Correct implementation pattern when VAD is enabled:

To avoid incorrect responses when VAD is on, you must inject persistent context before any audio is streamed.

Recommended pattern:

Create new session

Wait for session.updated

Inject compact memory (example: “User preference: vegan”) via conversation.item.create

Only then allow microphone/audio streaming

Let VAD auto‑trigger responses

This ordering is essential and follows the official conversation lifecycle rules.

Token pressure and truncation risks amplified by VAD Because VAD can produce multiple rapid responses, long replayed histories can quickly consume context. Realtime API provides conversation.item.truncate and conversation.item.delete to manage this inside a session, but nothing protects you across sessions.

Best practice with VAD:

Do not replay full transcripts

Replay only structured memory (preferences, constraints, goals)

Keep injected memory minimal and deterministic

This guidance aligns with official Realtime conversation mechanics. [Use the GP...soft Learn | Learn.Microsoft.com]

Notes:

VAD does not change the rule that Realtime GPT forgets across sessions

VAD does change response timing, making context injection order critical Without correct sequencing, VAD can produce responses before history is available

Solution: inject memory before audio, not after

Vegan‑style preferences must be treated as session‑startup memory, not conversational happenstance

I Hope this helps. Do let me know if you have any further queries.

Thank you!
Anonymous

2026-01-27T05:13:37.1266667+00:00

Hi It is VMS

Did you get any chance to review the above response. Thank you!

Answer accepted by question author

1 additional answer

Your answer

It is VMS 270 Reputation points

2026-01-23T03:05:14.5433333+00:00

Hi Sridhar

Thanks a million for the detailed note. You are a gem. I am a bit confused, so let me ask in a very functional way

User starts a new conversation. No history. We have "user message 1", and Azure Realtime GPT responds with "AI response 1", and similarly, "user message 2" and AI responses with "AI response 2".

User then disconnects and starts another session.

From what i understand,

On our frontend, we know the user, so we send the history messages as given in the snippet by you

after that we trigger the response.create

Now, here are the questions

let us assume that no errors are thrown back by Azure

Q1. Will the Azure Realtime GPT know the two messages the user had sent and what the AI had responded, so that, in the current conversation, that is part of the "context". For example, if "user message 1" was "I am a vegan. Use this in future when i ask you for options", then, in the 2nd conversation, will the AI know that this person had said that he/she was a vegan?

I do realize that we may run out of context (& if so, then Azure will throw back an error)

Thank you for clarifying this
It is VMS 270 Reputation points

2026-01-23T03:13:21.75+00:00

And, Sridhar, please when you respond, please highlight the impact with implementation of VAD, and how it will get impacted if we use VAD

Thanks again!
Anonymous

2026-01-27T05:13:37.1266667+00:00

Hi It is VMS

Did you get any chance to review the above response. Thank you!

Answer 1

Hello It is VMS,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your issue on Azure GPT Realtime | Question on history.

@Sridhar M have really tried in providing solid direction to resolve the issue. However, to avoid any Invalid “messages array inside content” pattern, where to put durable preferences, and not to forget server acks and the final trigger. The below snippet and links are to provide a clue to what you can do:

For full manual control of Canonical flow without VAD

   // 1) connect WS … (see Azure WS doc for auth/URL shape)
   ws.onmessage = (evt) => {
     const event = JSON.parse(evt.data);
     if (event.type === 'session.created' || event.type === 'session.updated') {
       // 2) Optional: inject durable instructions
       ws.send(JSON.stringify({
         type: 'session.update',
         session: { instructions: 'User is vegan. Always provide vegan options.' }
       }));
       // 3) Replay history (ONE item per prior turn)
       const history = [
         { role: 'user', text: 'I am a vegan. Remember this.' },
         { role: 'assistant', text: 'Got it. I will only suggest vegan options.' }
       ];
       for (const turn of history) {
         ws.send(JSON.stringify({
           type: 'conversation.item.create',
           item: {
             type: 'message',
             role: turn.role, // 'system' | 'user' | 'assistant'
             content: [{ type: 'input_text', text: turn.text }]
           }
         }));
       }
       // 4) Current user message
       ws.send(JSON.stringify({
         type: 'conversation.item.create',
         item: {
           type: 'message',
           role: 'user',
           content: [{ type: 'input_text', text: 'What should I eat today?' }]
         }
       }));
       // 5) Trigger the response (no VAD)
       ws.send(JSON.stringify({ type: 'response.create', response: { modalities: ['text','audio'] }}));
     }
   };

Observe the comments on them: That is a minimal js for steps, and basically, connect (WebRTC or WebSocket) and wait for session ready (session.created / session.updated). Set durable preferences as session instructions, though it's optional but recommended:

`session.update` with `session.instructions = "User is vegan. Always provide vegan options."` https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create Replay your compact history: send one `conversation.item.create` per turn with text parts only. Expect `conversation.item.created` for each. Send the current user turn via `conversation.item.create`. Trigger generation with `response.create` (and set modalities as needed). Stream outputs (`response.output_*`, `response.done`) and render audio/text. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic So, the schema and flow backed by Azure Audio events reference” and connection guides. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic and https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio-websockets?view=foundry-classic

For server turn‑taking of Canonical flow with VAD enabled Critical rule: Inject context before you start microphone streaming. With VAD, the server commits the user’s speech and can auto‑create a response at the end of speech; if memory isn’t in the conversation by then, it won’t be used. - https://platform.openai.com/docs/guides/realtime-vad Azure Realtime recommends WebRTC for low latency and built‑in media handling in the browser. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio-webrtc?view=foundry-classic
To encode prior turns correctly: One event per historical turn (no arrays of “message” objects inside content). The valid shape is:
```
   {
         "type": "conversation.item.create",
         "item": {
           "type": "message",
           "role": "assistant",
           "content": [{ "type": "input_text", "text": "Here are vegan options..." }]
         }
       }
```
This is exactly how the docs define message creation and history population. -https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic
On managing context size and what’s actually supported is that there’s no server command to “keep only last N items”. Use your app to decide what to inject: e.g., a short summary + last 2–4 turns. Then replay that subset each session.Azure docs enumerate the available client events; there is no general “truncate conversation to N items” call. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic conversation.item.truncate is only for assistant audio interruption, not for token budget management. If the user interrupts speech output, call:
```
   {
         "type": "conversation.item.truncate",
         "item_id": "<assistant_item_id>",
         "content_index": 0,
         "audio_end_ms": 1200
    }
```

For an example (end‑to‑end):

Session 1 User: “I am vegan.” > Assistant: acknowledges. Store (a) a durable preference “vegan” and (b) optionally the last couple of turns.
Session 2 (VAD ON) > correct sequence
1. Connect; wait for session.updated.
2. session.update with instructions: "User is vegan. Always provide vegan options.".
3. (Optional) Replay 1–3 compact text turns with conversation.item.create. [
4. Start microphone. VAD will commit your speech and auto‑trigger the assistant’s reply, which now respects the vegan preference.

NOTE: If you start audio before steps 2–3, the model can respond without the vegan context by design.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Answer 2

Hi It is VMS

Realtime GPT can “know” prior conversation history in a new session only if you resend that history as conversation items into the new session. The model does not automatically remember anything across sessions; you must replay context yourself. The official Realtime docs describe that a session has one conversation, and you add items via conversation.item.create. [How to use...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

The key limitation that often causes confusion (audio history): conversation.item.create can be used to populate a history of the conversation, but Microsoft’s Realtime “audio events reference” explicitly notes a limitation: it can’t populate assistant audio messages. So you can replay history as text items (user text + assistant text transcripts), but you cannot inject “past assistant audio” into the history as audio content. [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

Realtime is event-driven; it doesn’t take a single “messages array” payload like Chat Completions. Instead, the docs show you add items one by one using conversation.item.create events, and the server confirms with conversation.item.created. If you tried sending an array inside one create call, it can trigger schema/validation errors because the event expects a single item (not an array of items). [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

To preload history, send multiple conversation.item.create events—one event per turn, in the same order as the original conversation (system → user → assistant → user → assistant …). The Realtime docs define this event as the mechanism to “add a new item to the conversation’s context” and that it can populate history. [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

Triggering the model after replaying history (commonly missed): After you replay the history items, you still need to trigger generation using response.create (unless you’re using server VAD auto-response for live audio). The Learn doc explains that to get a response you send response.create and the server returns response.created and streamed deltas. [How to use...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

assistant audio can’t be populated via conversation.item.create, replay assistant turns as text (for example, the transcript of what the assistant said). The limitation is explicitly called out in the reference (“can’t populate assistant audio messages”), so treat history as text-only context even if your live session is audio+text. [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

Minimal event template (conceptual, aligned to docs): You would send repeated events shaped like: type: "conversation.item.create" with an item containing type: "message", a role (system/user/assistant), and a content array with text parts. This is consistent with the Learn examples showing message items created with text content and then response.create. [Use the GP...soft Learn | Learn.Microsoft.com], [How to use...soft Learn | Learn.Microsoft.com]

you may see “it works but the model ignores history”: Two typical causes are: (1) history was not actually added (server returned an error event because previous_item_id was invalid or schema mismatch), or (2) you added items but never triggered response.create. The Realtime “conversation sequence and items” and event reference emphasize server acknowledgement events (conversation.item.created) and the explicit response generation step. [How to use...soft Learn | Learn.Microsoft.com], [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

Practical, production-friendly approach (what most teams do): Persist full conversation history in your own storage (DB/Redis/Cosmos), then on each new Realtime session replay a trimmed subset (e.g., last N turns or a summary + last few turns) using conversation.item.create. This aligns with the documented Realtime mechanism (history is client-provided via items). [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

Example:

you mentioned trying to add it as a message array without success, here's an example of how you can structure the messages:

const historyMessages = [
    {
        type: 'message',
        role: 'user',
        content: [{ type: 'input_text', text: 'Previous user message 1' }]
    },
    {
        type: 'message',
        role: 'assistant',
        content: [{ type: 'input_text', text: 'Previous assistant response 1' }]
    }
];

// Combine with the current message
realtimeClient.send({
    'type': 'conversation.item.create',
    'item': {
        'type': 'message',
        'role': 'user',
        'content': [
            ...historyMessages, // Include previous messages
            { type: 'input_text', text: 'Please assist the user.' }
        ]
    }
});

Make sure that your implementation conforms to the expected input format for the Realtime API.

References:

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Anonymous

2026-01-28T07:27:17.4733333+00:00

Hi It is VMS

If this answers your query, do click Accept Answer and Yes for was this answer helpful.
Anonymous

2026-01-29T01:54:49.9433333+00:00

Hi It is VMS

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

Share via

Azure GPT Realtime. Question on history

1 additional answer

References:

Your answer