Share via

Azure GPT Realtime. Question on history

It is VMS 270 Reputation points
2026-01-21T08:20:40.2566667+00:00

Hello

I have the Azure Realtime GPT working well. Production ready deployment.

I have a question on history of the user's previous conversations

    // After the session is configured, data can be sent to the session.    
    realtimeClient.send({
        'type': 'conversation.item.create',
        'item': {
            'type': 'message',
            'role': 'user',
            'content': [{
                type: 'input_text',
                text: 'Please assist the user.'
            }
            ]
        }
    });

I realize that by default, every session with the realtime gpt is a "new" session. And, if possible, i would like to add the user's history, at least send it in text to the Realtime GPT Azure endpoint, so that the GPT "knows" the history. Is this even possible? I tried adding it as a message array, didn't get any errors back from realtime GPT and don't see it working

Any thoughts on this?

Thanks

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform


Answer accepted by question author

  1. Sina Salam 28,606 Reputation points Volunteer Moderator
    2026-01-27T13:46:02.67+00:00

    Hello It is VMS,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your issue on Azure GPT Realtime | Question on history.

    @Sridhar M have really tried in providing solid direction to resolve the issue. However, to avoid any Invalid “messages array inside content” pattern, where to put durable preferences, and not to forget server acks and the final trigger. The below snippet and links are to provide a clue to what you can do:

    1. For full manual control of Canonical flow without VAD
         // 1) connect WS … (see Azure WS doc for auth/URL shape)
         ws.onmessage = (evt) => {
           const event = JSON.parse(evt.data);
           if (event.type === 'session.created' || event.type === 'session.updated') {
             // 2) Optional: inject durable instructions
             ws.send(JSON.stringify({
               type: 'session.update',
               session: { instructions: 'User is vegan. Always provide vegan options.' }
             }));
             // 3) Replay history (ONE item per prior turn)
             const history = [
               { role: 'user', text: 'I am a vegan. Remember this.' },
               { role: 'assistant', text: 'Got it. I will only suggest vegan options.' }
             ];
             for (const turn of history) {
               ws.send(JSON.stringify({
                 type: 'conversation.item.create',
                 item: {
                   type: 'message',
                   role: turn.role, // 'system' | 'user' | 'assistant'
                   content: [{ type: 'input_text', text: turn.text }]
                 }
               }));
             }
             // 4) Current user message
             ws.send(JSON.stringify({
               type: 'conversation.item.create',
               item: {
                 type: 'message',
                 role: 'user',
                 content: [{ type: 'input_text', text: 'What should I eat today?' }]
               }
             }));
             // 5) Trigger the response (no VAD)
             ws.send(JSON.stringify({ type: 'response.create', response: { modalities: ['text','audio'] }}));
           }
         };
      
      Observe the comments on them: That is a minimal js for steps, and basically, connect (WebRTC or WebSocket) and wait for session ready (session.created / session.updated). Set durable preferences as session instructions, though it's optional but recommended:
      `session.update` with `session.instructions = "User is vegan. Always provide vegan options."` https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create Replay your compact history: send one `conversation.item.create` per turn with text parts only. Expect `conversation.item.created` for each. Send the current user turn via `conversation.item.create`. Trigger generation with `response.create` (and set modalities as needed). Stream outputs (`response.output_*`, `response.done`) and render audio/text. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic So, the schema and flow backed by Azure Audio events reference” and connection guides. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic and https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio-websockets?view=foundry-classic
      
    2. For server turn‑taking of Canonical flow with VAD enabled Critical rule: Inject context before you start microphone streaming. With VAD, the server commits the user’s speech and can auto‑create a response at the end of speech; if memory isn’t in the conversation by then, it won’t be used. - https://platform.openai.com/docs/guides/realtime-vad Azure Realtime recommends WebRTC for low latency and built‑in media handling in the browser. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio-webrtc?view=foundry-classic
    3. To encode prior turns correctly: One event per historical turn (no arrays of “message” objects inside content). The valid shape is:
         {
               "type": "conversation.item.create",
               "item": {
                 "type": "message",
                 "role": "assistant",
                 "content": [{ "type": "input_text", "text": "Here are vegan options..." }]
               }
             }
      
      This is exactly how the docs define message creation and history population. -https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic
    4. On managing context size and what’s actually supported is that there’s no server command to “keep only last N items”. Use your app to decide what to inject: e.g., a short summary + last 2–4 turns. Then replay that subset each session.Azure docs enumerate the available client events; there is no general “truncate conversation to N items” call. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic conversation.item.truncate is only for assistant audio interruption, not for token budget management. If the user interrupts speech output, call:
         {
               "type": "conversation.item.truncate",
               "item_id": "<assistant_item_id>",
               "content_index": 0,
               "audio_end_ms": 1200
          }
      

    For an example (end‑to‑end):

    1. Session 1 User: “I am vegan.” > Assistant: acknowledges. Store (a) a durable preference “vegan” and (b) optionally the last couple of turns.
    2. Session 2 (VAD ON) > correct sequence
      1. Connect; wait for session.updated.
      2. session.update with instructions: "User is vegan. Always provide vegan options.".
      3. (Optional) Replay 1–3 compact text turns with conversation.item.create. [
      4. Start microphone. VAD will commit your speech and auto‑trigger the assistant’s reply, which now respects the vegan preference.

    NOTE: If you start audio before steps 2–3, the model can respond without the vegan context by design.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Anonymous
    2026-01-21T09:34:52.8133333+00:00

    Hi It is VMS

    Realtime GPT can “know” prior conversation history in a new session only if you resend that history as conversation items into the new session. The model does not automatically remember anything across sessions; you must replay context yourself. The official Realtime docs describe that a session has one conversation, and you add items via conversation.item.create. [How to use...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    The key limitation that often causes confusion (audio history): conversation.item.create can be used to populate a history of the conversation, but Microsoft’s Realtime “audio events reference” explicitly notes a limitation: it can’t populate assistant audio messages. So you can replay history as text items (user text + assistant text transcripts), but you cannot inject “past assistant audio” into the history as audio content. [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    Realtime is event-driven; it doesn’t take a single “messages array” payload like Chat Completions. Instead, the docs show you add items one by one using conversation.item.create events, and the server confirms with conversation.item.created. If you tried sending an array inside one create call, it can trigger schema/validation errors because the event expects a single item (not an array of items). [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    To preload history, send multiple conversation.item.create events—one event per turn, in the same order as the original conversation (system → user → assistant → user → assistant …). The Realtime docs define this event as the mechanism to “add a new item to the conversation’s context” and that it can populate history. [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    Triggering the model after replaying history (commonly missed): After you replay the history items, you still need to trigger generation using response.create (unless you’re using server VAD auto-response for live audio). The Learn doc explains that to get a response you send response.create and the server returns response.created and streamed deltas. [How to use...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    assistant audio can’t be populated via conversation.item.create, replay assistant turns as text (for example, the transcript of what the assistant said). The limitation is explicitly called out in the reference (“can’t populate assistant audio messages”), so treat history as text-only context even if your live session is audio+text. [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    Minimal event template (conceptual, aligned to docs): You would send repeated events shaped like: type: "conversation.item.create" with an item containing type: "message", a role (system/user/assistant), and a content array with text parts. This is consistent with the Learn examples showing message items created with text content and then response.create. [Use the GP...soft Learn | Learn.Microsoft.com], [How to use...soft Learn | Learn.Microsoft.com]

    you may see “it works but the model ignores history”: Two typical causes are: (1) history was not actually added (server returned an error event because previous_item_id was invalid or schema mismatch), or (2) you added items but never triggered response.create. The Realtime “conversation sequence and items” and event reference emphasize server acknowledgement events (conversation.item.created) and the explicit response generation step. [How to use...soft Learn | Learn.Microsoft.com], [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    Practical, production-friendly approach (what most teams do): Persist full conversation history in your own storage (DB/Redis/Cosmos), then on each new Realtime session replay a trimmed subset (e.g., last N turns or a summary + last few turns) using conversation.item.create. This aligns with the documented Realtime mechanism (history is client-provided via items). [Audio even...soft Learn | Learn.Microsoft.com], [learn.microsoft.com]

    Example:

    you mentioned trying to add it as a message array without success, here's an example of how you can structure the messages:

    const historyMessages = [
        {
            type: 'message',
            role: 'user',
            content: [{ type: 'input_text', text: 'Previous user message 1' }]
        },
        {
            type: 'message',
            role: 'assistant',
            content: [{ type: 'input_text', text: 'Previous assistant response 1' }]
        }
    ];
    
    // Combine with the current message
    realtimeClient.send({
        'type': 'conversation.item.create',
        'item': {
            'type': 'message',
            'role': 'user',
            'content': [
                ...historyMessages, // Include previous messages
                { type: 'input_text', text: 'Please assist the user.' }
            ]
        }
    });
    

    Make sure that your implementation conforms to the expected input format for the Realtime API.

    References:

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    1 person found this answer helpful.

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.