Azure OpenAI Assistants API (Preview)

2025-07-02

Note

The Azure AI Foundry Agent Service is now Generally Available, which provides more tools and better enterprise features. We recommend using the new service for the latest feature updates and improvements.

Assistants, a feature of Azure OpenAI in Azure AI Foundry Models, is available in public preview starting in the 2024-02-15-preview API version. Assistants API makes it easier for developers to create applications with sophisticated copilot-like experiences that can sift through data, suggest solutions, and automate tasks.

Assistants can call Azure OpenAI’s models with specific instructions to tune their personality and capabilities.
Assistants can access multiple tools in parallel. These can be both Azure OpenAI-hosted tools like code interpreter and file search, or tools you build, host, and access through function calling.
Assistants can access persistent Threads. Threads simplify AI application development by storing message history and truncating it when the conversation gets too long for the model's context length. You create a Thread once, and append Messages to it as your users reply.
Assistants can access files in several formats. Either as part of their creation or as part of Threads between Assistants and users. When using tools, Assistants can also create files (such as images or spreadsheets) and cite files they reference in the Messages they create.

Overview

Previously, building custom AI assistants needed heavy lifting even for experienced developers. While the chat completions API is lightweight and powerful, it's inherently stateless, which means that developers had to manage conversation state and chat threads, tool integrations, retrieval documents and indexes, and execute code manually.

The Assistants API, as the stateful evolution of the chat completion API, provides a solution for these challenges. Assistants API supports persistent automatically managed threads. This means that as a developer you no longer need to develop conversation state management systems and work around a model’s context window constraints. The Assistants API will automatically handle the optimizations to keep the thread below the max context window of your chosen model. Once you create a Thread, you can append new messages to it as users respond. Assistants can also access multiple tools in parallel, if needed. These tools include:

Tip

There is no additional pricing or quota for using Assistants unless you use the code interpreter or file search tools.

Assistants API is built on the same capabilities that power OpenAI’s GPT product. Some possible use cases range from AI-powered product recommender, sales analyst app, coding assistant, employee Q&A chatbot, and more. Start building on the no-code Assistants playground on the Azure AI Foundry portal or start building with the API.

Important

Retrieving untrusted data using Function calling, Code Interpreter or File Search with file input, and Assistant Threads functionalities could compromise the security of your Assistant, or the application that uses the Assistant. Learn about mitigation approaches here.

Using assistants

For information on using assistants, see the following reference documentation.

Available models

To see a list of Azure OpenAI models that you can use with assistants, see the Models article.

Assistants playground

Before using assistants, you need:

A compatible model deployed. For more information about model deployment, see the resource deployment guide.
An Azure AI project in Azure AI Foundry portal.

Assistants components

Component	Description
Assistant	Custom AI that uses Azure OpenAI models in conjunction with tools.
Thread	A conversation session between an Assistant and a user. Threads store Messages and automatically handle truncation to fit content into a model’s context.
Message	A message created by an Assistant or a user. Messages can include text, images, and other files. Messages are stored as a list on the Thread.
Run	Activation of an Assistant to begin running based on the contents of the Thread. The Assistant uses its configuration and the Thread’s Messages to perform tasks by calling models and tools. As part of a Run, the Assistant appends Messages to the Thread.
Run Step	A detailed list of steps the Assistant took as part of a Run. An Assistant can call tools or create Messages during it’s run. Examining Run Steps allows you to understand how the Assistant is getting to its final results.

Assistants data access

Currently, assistants, threads, messages, and files created for Assistants are scoped at the Azure OpenAI resource level. Therefore, anyone with access to the Azure OpenAI resource or API key access is able to read/write assistants, threads, messages, and files.

We strongly recommend the following data access controls:

Implement authorization. Before performing reads or writes on assistants, threads, messages, and files, ensure that the end-user is authorized to do so.
Restrict Azure OpenAI resource and API key access. Carefully consider who has access to Azure OpenAI resources where assistants are being used and associated API keys.
Routinely audit which accounts/individuals have access to the Azure OpenAI resource. API keys and resource level access enable a wide range of operations including reading and modifying messages and files.
Enable diagnostic settings to allow long-term tracking of certain aspects of the Azure OpenAI resource's activity log.

Parameters

The Assistants API has support for several parameters that let you customize the Assistants' output. The tool_choice parameter lets you force the Assistant to use a specified tool. You can also create messages with the assistant role to create custom conversation histories in Threads. temperature, top_p, response_format let you further tune responses. For more information, see the reference documentation.

Context window management

Assistants automatically truncate text to ensure it stays within the model's maximum context length. You can customize this behavior by specifying the maximum tokens you'd like a run to utilize and/or the maximum number of recent messages you'd like to include in a run.

Max completion and max prompt tokens

To control the token usage in a single Run, set max_prompt_tokens and max_completion_tokens when you create the Run. These limits apply to the total number of tokens used in all completions throughout the Run's lifecycle.

For example, initiating a Run with max_prompt_tokens set to 500 and max_completion_tokens set to 1000 means the first completion will truncate the thread to 500 tokens and cap the output at 1,000 tokens. If only 200 prompt tokens and 300 completion tokens are used in the first completion, the second completion will have available limits of 300 prompt tokens and 700 completion tokens.

If a completion reaches the max_completion_tokens limit, the Run will terminate with a status of incomplete, and details will be provided in the incomplete_details field of the Run object.

When using the File Search tool, we recommend setting the max_prompt_tokens to no less than 20,000. For longer conversations or multiple interactions with File Search, consider increasing this limit to 50,000, or ideally, removing the max_prompt_tokens limits altogether to get the highest quality results.

Truncation strategy

You can also specify a truncation strategy to control how your thread should be rendered into the model's context window. Using a truncation strategy of type auto will use OpenAI's default truncation strategy. Using a truncation strategy of type last_messages will allow you to specify the number of the most recent messages to include in the context window.