Share via

How to configure Thread Truncation / Compaction Strategy for Azure AI Foundry Agents?

hok yan tam 45 Reputation points
2026-05-11T01:14:23.2533333+00:00

Environment Setup:

  • Platform: New Azure AI Foundry (portal)
  • Architecture: RAG-based Agent (Foundry IQ / Vector Knowledge Base)
  • Deployment Target: Microsoft Teams (via the built-in "Publish to Teams" connector)

The Problem: I am building a RAG agent in Azure AI new Foundry IQ. Because conversations are durable and store all items, the conversation_id is persisting both the message items and the massive tool_output items (the raw chunks retrieved from the knowledge base) across every turn.

In my testing, this unbounded history accumulates rapidly. After just a few questions, the accumulated context window balloons, leading to a higher and higher token usage.

Currently in the Foundry Playground, I can manually reset the chat to clear the context, but this isn't a setting for me to change the memory setting. I assume we need to get this setting done before deploying to Teams or Copilot.

Any guidance on how to properly implement a "sliding window" or other strategies instead of putting all the stuff as input for the following questions would be greatly appreciated!

Foundry IQ
Foundry IQ

Knowledge index in Microsoft Foundry that lets AI agents retrieve grounded information from organization’s data

0 comments No comments

Answer accepted by question author

Anshika Varshney 12,365 Reputation points Microsoft External Staff Moderator
2026-05-14T15:06:05.7833333+00:00

Hey hok yan tam,

You’re right to look for a way to control thread size, but currently Azure AI Foundry Agents (service-based threads) work a bit differently.

At the moment, there is no built-in setting in the Foundry portal to automatically truncate, compact, or limit thread history.

The service is designed to persist the entire conversation history for each thread, and it does not provide a native option to automatically trim older messages.

What you can do instead

  1. Control context at request time (recommended)

Even though the full thread is stored, you can control what is actually sent to the model.

  • When you send a new request, you can:
    • only include the latest N messages
    • or manually build a smaller message list

This is the most common approach and gives you full control over tokens and context size.

  1. Implement your own truncation or summarization logic

Since the service does not auto-truncate, you can:

  • Keep only last few turns in your app logic
  • Or summarize older messages and keep a short summary instead

This helps reduce token usage and keeps conversations efficient.

  1. Manage threads manually

For long-running conversations:

  • You can create a new thread periodically
  • Or delete old threads using the API if no longer needed

This helps control storage and avoids very large histories.

Important note

  • The Foundry portal thread view always shows full history
  • Any truncation you do in your code will not change what is stored, only what is sent to the model

Summary

  • No built-in truncation or compaction setting in Foundry Agents today
  • You must handle it at application level
  • Best practice is to:
    • send only recent messages
    • or use summarization for older context

I Hope this helps. Do let me know if you have any further queries.

Thankyou!

Was this answer helpful?

1 person found this answer helpful.

Answer accepted by question author

kagiyama yutaka 3,430 Reputation points
2026-05-11T12:30:05.58+00:00

I think Foundry just doesn’t give us any option for truncation or a sliding‑window right now. we just trim the history on our side before the next turn.

Was this answer helpful?

1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. AI answer

    2026-05-11T01:14:37.97+00:00

    Azure AI Foundry agents (Foundry IQ) use service-managed conversation state. For these agents, context window management and truncation are handled by the service, and custom compaction strategies cannot be configured.

    Compaction strategies such as sliding windows, truncation, tool-result compaction, or summarization apply only to agents that manage their own conversation history in memory. Examples of agents where compaction is relevant include those built with the agent framework that maintain an in-memory message list and send that full history to the model on each call.

    For service-managed agents, including:

    • Foundry Agents
    • Responses API with store enabled (default)
    • Copilot Studio agents

    configuring a compaction strategy has no effect. The service already manages context, including truncation, to stay within token limits.

    In this architecture (Foundry IQ RAG agent published to Teams), conversation history and tool outputs are stored and managed by the service. There is no supported way to inject a custom sliding-window or compaction strategy into that pipeline. Token and context management are handled automatically by the platform.

    If custom compaction behavior is required (for example, sliding window, tool-result summarization, or truncation), the supported approach is to build an in-memory agent using the agent framework and apply compaction strategies there, rather than relying on service-managed context.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.