Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform
Hello It is VMS,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that your issue on Azure GPT Realtime | Question on history.
@Sridhar M have really tried in providing solid direction to resolve the issue. However, to avoid any Invalid “messages array inside content” pattern, where to put durable preferences, and not to forget server acks and the final trigger. The below snippet and links are to provide a clue to what you can do:
- For full manual control of Canonical flow without VAD
Observe the comments on them: That is a minimal js for steps, and basically, connect (WebRTC or WebSocket) and wait for session ready (// 1) connect WS … (see Azure WS doc for auth/URL shape) ws.onmessage = (evt) => { const event = JSON.parse(evt.data); if (event.type === 'session.created' || event.type === 'session.updated') { // 2) Optional: inject durable instructions ws.send(JSON.stringify({ type: 'session.update', session: { instructions: 'User is vegan. Always provide vegan options.' } })); // 3) Replay history (ONE item per prior turn) const history = [ { role: 'user', text: 'I am a vegan. Remember this.' }, { role: 'assistant', text: 'Got it. I will only suggest vegan options.' } ]; for (const turn of history) { ws.send(JSON.stringify({ type: 'conversation.item.create', item: { type: 'message', role: turn.role, // 'system' | 'user' | 'assistant' content: [{ type: 'input_text', text: turn.text }] } })); } // 4) Current user message ws.send(JSON.stringify({ type: 'conversation.item.create', item: { type: 'message', role: 'user', content: [{ type: 'input_text', text: 'What should I eat today?' }] } })); // 5) Trigger the response (no VAD) ws.send(JSON.stringify({ type: 'response.create', response: { modalities: ['text','audio'] }})); } };session.created/session.updated). Set durable preferences as session instructions, though it's optional but recommended:`session.update` with `session.instructions = "User is vegan. Always provide vegan options."` https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create Replay your compact history: send one `conversation.item.create` per turn with text parts only. Expect `conversation.item.created` for each. Send the current user turn via `conversation.item.create`. Trigger generation with `response.create` (and set modalities as needed). Stream outputs (`response.output_*`, `response.done`) and render audio/text. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic So, the schema and flow backed by Azure Audio events reference” and connection guides. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic and https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio-websockets?view=foundry-classic - For server turn‑taking of Canonical flow with VAD enabled Critical rule: Inject context before you start microphone streaming. With VAD, the server commits the user’s speech and can auto‑create a response at the end of speech; if memory isn’t in the conversation by then, it won’t be used. - https://platform.openai.com/docs/guides/realtime-vad Azure Realtime recommends WebRTC for low latency and built‑in media handling in the browser. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio-webrtc?view=foundry-classic
- To encode prior turns correctly: One event per historical turn (no arrays of “message” objects inside
content). The valid shape is:
This is exactly how the docs define message creation and history population. -https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic{ "type": "conversation.item.create", "item": { "type": "message", "role": "assistant", "content": [{ "type": "input_text", "text": "Here are vegan options..." }] } } - On managing context size and what’s actually supported is that there’s no server command to “keep only last N items”. Use your app to decide what to inject: e.g., a short summary + last 2–4 turns. Then replay that subset each session.Azure docs enumerate the available client events; there is no general “truncate conversation to N items” call. - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic
conversation.item.truncateis only for assistant audio interruption, not for token budget management. If the user interrupts speech output, call:{ "type": "conversation.item.truncate", "item_id": "<assistant_item_id>", "content_index": 0, "audio_end_ms": 1200 }- https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-reference?view=foundry-classic
- https://v03.api.js.langchain.com/interfaces/_langchain_openai.OpenAIClient.Beta.Realtime.ConversationItemTruncateEvent.html
- https://docs.rs/openai-openapi-types/latest/openai_openapi_types/struct.RealtimeClientEventConversationItemTruncate.html
For an example (end‑to‑end):
- Session 1 User: “I am vegan.” > Assistant: acknowledges. Store (a) a durable preference “vegan” and (b) optionally the last couple of turns.
- Session 2 (VAD ON) > correct sequence
- Connect; wait for
session.updated. -
session.updatewithinstructions: "User is vegan. Always provide vegan options.". - (Optional) Replay 1–3 compact text turns with
conversation.item.create. [ - Start microphone. VAD will commit your speech and auto‑trigger the assistant’s reply, which now respects the vegan preference.
- Connect; wait for
NOTE: If you start audio before steps 2–3, the model can respond without the vegan context by design.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.