How to Feed Large, Time-Varying Context to Server-Side LLMs for Summarization and Managing Token Limits When Sending Evolving Context to LLMs

Ankita Inaniya 0 Reputation points Microsoft Employee
2025-05-07T06:36:49.7933333+00:00

We have a use case where data from one of our APIs evolves over a two-day period. This data is shown to customers on the UI, and we want to pass the same data to a Copilot model so users can ask questions or request summaries based on that context.

However, the dataset is quite large, and since we’re using a client-server architecture with the LLM hosted on the server, we’re hitting token size limitations when trying to send the full context to the model.

One approach we discussed is having the client break the data into chunks and send them incrementally during chat initialization—essentially streaming the context to the server-side LLM until the full payload is delivered.

We’d like to understand: Is this a recommended approach for handling large context inputs, or are there better alternatives for managing token limits on the client side?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,983 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SriLakshmi C 5,050 Reputation points Microsoft External Staff Moderator
    2025-05-07T11:49:40.73+00:00

    Hello @Ankita Inaniya,

    To effectively manage large and evolving datasets when interacting with server-side language models (LLMs), especially in the context of summarization and question-answering, streaming the context data in chunks from the client to the server is a practical and scalable solution. This method helps navigate token limitations such as the 4096-token limit in models like gpt-35-turbo by breaking down the input into smaller, manageable units instead of sending one large payload.

    When implementing this approach, it's important to ensure that each chunk contains coherent and contextually meaningful information to maintain the relevancy of responses. Additionally, maintaining the conversational state is crucial; this can be achieved through a session manager that tracks user interactions and updates context as new data is streamed. Incorporating dynamic truncation techniques can also help prioritize and retain the most relevant information, especially as the conversation evolves and token constraints become tighter.

    Alternatively, for extremely large datasets, a summarization step before sending context to the LLM can be beneficial, reducing input size while preserving key information. Regardless of the approach, careful memory and performance monitoring is recommended, as increased context size may impact latency and resource utilization.

    Chunk-based context streaming offers a flexible and efficient method to manage large inputs within token constraints, while supporting real-time updates and sustained conversational relevance in client-server LLM applications.

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.