Azure AI Foundry Models available for serverless API deployment

2025-05-19

The Azure AI model catalog offers a large selection of Azure AI Foundry Models from a wide range of providers. You have various options for deploying models from the model catalog. This article lists Azure AI Foundry Models that can be deployed via serverless API deployment. For some of these models, you can also host them on your infrastructure for deployment via managed compute.

Important

Models that are in preview are marked as preview on their model cards in the model catalog.

To perform inferencing with the models, some models such as Nixtla's TimeGEN-1 and Cohere rerank require you to use custom APIs from the model providers. Others support inferencing using the Model Inference API. You can find more details about individual models by reviewing their model cards in the model catalog for Azure AI Foundry portal.

AI21 Labs

The Jamba family models are AI21's production-grade Mamba-based large language model (LLM) which uses AI21's hybrid Mamba-Transformer architecture. It's an instruction-tuned version of AI21's hybrid structured state space model (SSM) transformer Jamba model. The Jamba family models are built for reliable commercial use with respect to quality and performance.

Model	Type	Capabilities
AI21-Jamba-1.5-Mini	chat-completion	- Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
AI21-Jamba-1.5-Large	chat-completion	- Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs

See this model collection in Azure AI Foundry portal.

Azure OpenAI

Azure OpenAI in Foundry Models offers a diverse set of models with different capabilities and price points. These models include:

State-of-the-art models designed to tackle reasoning and problem-solving tasks with increased focus and capability
Models that can understand and generate natural language and code
Models that can transcribe and translate speech to text

Model	Type	Capabilities
o3-mini	chat-completion	- Input: text and image (200,000 tokens) - Output: text (100,000 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
o1	chat-completion (with images)	- Input: text and image (200,000 tokens) - Output: text (100,000 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
o1-preview	chat-completion	- Input: text (128,000 tokens) - Output: text (32,768 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
o1-mini	chat-completion	- Input: text (128,000 tokens) - Output: text (65,536 tokens) - Tool calling: No - Response formats: Text
gpt-4o-realtime-preview	real-time	- Input: control, text, and audio (131,072 tokens) - Output: text and audio (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON
gpt-4o	chat-completion (with image and audio content)	- Input: text, image, and audio (131,072 tokens) - Output: text (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
gpt-4o-mini	chat-completion (with image and audio content)	- Input: text, image, and audio (131,072 tokens) - Output: text (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
text-embedding-3-large	embeddings	- Input: text (8,191 tokens) - Output: Vector (3,072 dim.)
text-embedding-3-small	embeddings	- Input: text (8,191 tokens) - Output: Vector (1,536 dim.)

See this model collection in Azure AI Foundry portal.

Cohere

The Cohere family of models includes various models optimized for different use cases, including rerank, chat completions, and embeddings models.

Cohere command and embed

The following table lists the Cohere models that you can inference via the Model Inference API.

Model	Type	Capabilities
Cohere-command-A	chat-completion	- Input: text (256,000 tokens) - Output: text (8,000 tokens) - Tool calling: Yes - Response formats: Text
Cohere-command-r-plus-08-2024	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-command-r-08-2024	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-command-r-plus (deprecated)	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-command-r (deprecated)	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-embed-v-4	embeddings image-embeddings	- Input: image, text - Output: image, text (128,000 tokens) - Tool calling: Yes - Response formats: image, text
Cohere-embed-v3-english	embeddings image-embeddings	- Input: text (512 tokens) - Output: Vector (1,024 dim.)
Cohere-embed-v3-multilingual	embeddings image-embeddings	- Input: text (512 tokens) - Output: Vector (1,024 dim.)

Inference examples: Cohere command and embed

For more examples of how to use Cohere models, see the following examples:

Description	Language	Sample
Web requests	Bash	Command-R Command-R+ cohere-embed.ipynb
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
OpenAI SDK (experimental)	Python	Link
LangChain	Python	Link
Cohere SDK	Python	Command Embed
LiteLLM SDK	Python	Link

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Description	Packages	Sample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain	`langchain`, `langchain_cohere`	cohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain	`langchain`, `langchain_cohere`	command_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain	`langchain`, `langchain_cohere`	cohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK	`cohere`, `azure_search_documents`	cohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChain	`cohere`, `langchain`, `langchain_cohere`	command_tools-langchain.ipynb

Cohere rerank

The following table lists the Cohere rerank models. To perform inferencing with these rerank models, you're required to use Cohere's custom rerank APIs that are listed in the table.

Model	Type	Inference API
Cohere-rerank-v3.5	rerank text classification	Cohere's v2/rerank API
Cohere-rerank-v3-english (deprecated)	rerank text classification	Cohere's v2/rerank API Cohere's v1/rerank API
Cohere-rerank-v3-multilingual (deprecated)	rerank text classification	Cohere's v2/rerank API Cohere's v1/rerank API

Pricing for Cohere rerank models

Queries, not to be confused with a user's query, is a pricing meter that refers to the cost associated with the tokens used as input for inference of a Cohere Rerank model. Cohere counts a single search unit as a query with up to 100 documents to be ranked. Documents longer than 500 tokens (for Cohere-rerank-v3.5) or longer than 4096 tokens (for Cohere-rerank-v3-English and Cohere-rerank-v3-multilingual) when including the length of the search query are split up into multiple chunks, where each chunk counts as a single document.

See the Cohere model collection in Azure AI Foundry portal.

Core42

Core42 includes autoregressive bi-lingual LLMs for Arabic & English with state-of-the-art capabilities in Arabic.

Model	Type	Capabilities
jais-30b-chat	chat-completion	- Input: text (8,192 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON

See this model collection in Azure AI Foundry portal.

Inference examples: Core42

For more examples of how to use Jais models, see the following examples:

Description	Language	Sample
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link

DeepSeek

DeepSeek family of models includes DeepSeek-R1, which excels at reasoning tasks using a step-by-step training process, such as language, scientific reasoning, and coding tasks, DeepSeek-V3-0324, a Mixture-of-Experts (MoE) language model, and more.

Model	Type	Capabilities
DeepSeek-R1-0528	chat-completion with reasoning content	- Input: text (163,840 tokens) - Output: text (163,840 tokens) - Languages: `en` and `zh` - Tool calling: No - Response formats: Text
DeekSeek-V3-0324	chat-completion	- Input: text (131,072 tokens) - Output: (131,072 tokens) - Tool calling: No - Response formats: Text, JSON
DeepSeek-V3 (Legacy)	chat-completion	- Input: text (131,072 tokens) - Output: text (131,072 tokens) - Tool calling: No - Response formats: Text, JSON
DeepSeek-R1	chat-completion with reasoning content	- Input: text (163,840 tokens) - Output: text (163,840 tokens) - Tool calling: No - Response formats: Text.

For a tutorial on DeepSeek-R1, see Tutorial: Get started with DeepSeek-R1 reasoning model in Foundry Models.

See this model collection in Azure AI Foundry portal.

Inference examples: DeepSeek

For more examples of how to use DeepSeek models, see the following examples:

Description	Language	Sample
Azure AI Inference package for Python	Python	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for Java	Java	Link

Model	Type	Capabilities
Llama-4-Scout-17B-16E-Instruct	chat-completion	- Input: text and image (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Llama 4-Maverick-17B-128E-Instruct-FP8	chat-completion	- Input: text and image (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Llama-3.3-70B-Instruct	chat-completion	- Input: text (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Llama-3.2-90B-Vision-Instruct	chat-completion (with images)	- Input: text and image (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Llama-3.2-11B-Vision-Instruct	chat-completion (with images)	- Input: text and image (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3.1-8B-Instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3.1-405B-Instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3.1-70B-Instruct (deprecated)	chat-completion	- Input: text (131,072 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3-8B-Instruct (deprecated)	chat-completion	- Input: text (8,192 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3-70B-Instruct (deprecated)	chat-completion	- Input: text (8,192 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text

Microsoft

Microsoft models include various model groups such as MAI models, Phi models, healthcare AI models, and more. To see all the available Microsoft models, view the Microsoft model collection in Azure AI Foundry portal.

Model	Type	Capabilities
MAI-DS-R1	chat-completion with reasoning content	- Input: text (163,840 tokens) - Output: text (163,840 tokens) - Tool calling: No - Response formats: Text.
Phi-4-reasoning	chat-completion with reasoning content	- Input: text (32768 tokens) - Output: text (32768 tokens) - Tool calling: No - Response formats: Text
Phi-4-mini-reasoning	chat-completion with reasoning content	- Input: text (128,000 tokens) - Output: text (128,000 tokens) - Tool calling: No - Response formats: Text
Phi-4-multimodal-instruct	chat-completion (with image and audio content)	- Input: text, images, and audio (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-4-mini-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-4	chat-completion	- Input: text (16,384 tokens) - Output: text (16,384 tokens) - Tool calling: No - Response formats: Text
Phi-3.5-mini-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3.5-MoE-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3.5-vision-instruct	chat-completion (with images)	- Input: text and image (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-mini-128k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-mini-4k-instruct	chat-completion	- Input: text (4,096 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-small-128k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-small-8k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-medium-128k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-medium-4k-instruct	chat-completion	- Input: text (4,096 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text

Inference examples: Microsoft models

For more examples of how to use Microsoft models, see the following examples:

Description	Language	Sample
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
LangChain	Python	Link
Llama-Index	Python	Link

See the Microsoft model collection in Azure AI Foundry portal.

Mistral AI

Mistral AI offers two categories of models, namely:

Premium models: These include Mistral Large, Mistral Small, Mistral-OCR-2503, Mistral Medium 3 (25.05), and Ministral 3B models, and are available as serverless APIs with pay-as-you-go token-based billing.
Open models: These include Mistral-small-2503, Codestral, and Mistral Nemo (that are available as serverless APIs with pay-as-you-go token-based billing), and Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01(that are available to download and run on self-hosted managed endpoints).

Model	Type	Capabilities
Codestral-2501	chat-completion	- Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Ministral-3B	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-Nemo	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-Large-2411	chat-completion	- Input: text (128,000 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-large-2407 (deprecated)	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-large (deprecated)	chat-completion	- Input: text (32,768 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-medium-2505	chat-completion	- Input: text (128,000 tokens), image - Output: text (128,000 tokens) - Tool calling: No - Response formats: Text, JSON
Mistral-OCR-2503	image to text	- Input: image or PDF pages (1,000 pages, max 50MB PDF file) - Output: text - Tool calling: No - Response formats: Text, JSON, Markdown
Mistral-small-2503	chat-completion (with images)	- Input: text and images (131,072 tokens), image-based tokens are 16px x 16px blocks of the original images - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-small	chat-completion	- Input: text (32,768 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON

See this model collection in Azure AI Foundry portal.

Inference examples: Mistral

For more examples of how to use Mistral models, see the following examples and tutorials:

Description	Language	Sample
CURL request	Bash	Link
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
Python web requests	Python	Link
OpenAI SDK (experimental)	Python	Mistral - OpenAI SDK sample
LangChain	Python	Mistral - LangChain sample
Mistral AI	Python	Mistral - Mistral AI sample
LiteLLM	Python	Mistral - LiteLLM sample

Nixtla

Nixtla's TimeGEN-1 is a generative pre-trained forecasting and anomaly detection model for time series data. TimeGEN-1 can produce accurate forecasts for new time series without training, using only historical values and exogenous covariates as inputs.

To perform inferencing, TimeGEN-1 requires you to use Nixtla's custom inference API.

Model	Type	Capabilities	Inference API
TimeGEN-1	Forecasting	- Input: Time series data as JSON or dataframes (with support for multivariate input) - Output: Time series data as JSON - Tool calling: No - Response formats: JSON	Forecast client to interact with Nixtla's API

Estimate the number of tokens needed

Before you create a TimeGEN-1 deployment, it's useful to estimate the number of tokens that you plan to consume and be billed for. One token corresponds to one data point in your input dataset or output dataset.

Suppose you have the following input time series dataset:

Unique_id	Timestamp	Target Variable	Exogenous Variable 1	Exogenous Variable 2
BE	2016-10-22 00:00:00	70.00	49593.0	57253.0
BE	2016-10-22 01:00:00	37.10	46073.0	51887.0

To determine the number of tokens, multiply the number of rows (in this example, two) and the number of columns used for forecasting—not counting the unique_id and timestamp columns (in this example, three) to get a total of six tokens.

Given the following output dataset:

Unique_id	Timestamp	Forecasted Target Variable
BE	2016-10-22 02:00:00	46.57
BE	2016-10-22 03:00:00	48.57

You can also determine the number of tokens by counting the number of data points returned after data forecasting. In this example, the number of tokens is two.

Estimate pricing based on tokens

There are four pricing meters that determine the price you pay. These meters are as follows:

Pricing Meter	Description
paygo-inference-input-tokens	Costs associated with the tokens used as input for inference when finetune_steps = 0
paygo-inference-output-tokens	Costs associated with the tokens used as output for inference when finetune_steps = 0
paygo-finetuned-model-inference-input-tokens	Costs associated with the tokens used as input for inference when finetune_steps > 0
paygo-finetuned-model-inference-output-tokens	Costs associated with the tokens used as output for inference when finetune_steps > 0

See the Nixtla model collection in Azure AI Foundry portal.

NTT DATA

tsuzumi is an autoregressive language optimized transformer. The tuned versions use supervised fine-tuning (SFT). tsuzumi handles both Japanese and English language with high efficiency.

Model	Type	Capabilities
tsuzumi-7b	chat-completion	- Input: text (8,192 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text

Stability AI

The Stability AI collection of image generation models include Stable Image Core, Stable Image Ultra and Stable Diffusion 3.5 Large. Stable Diffusion 3.5 Large allows for an image and text input.

Model	Type	Capabilities
Stable Diffusion 3.5 Large	Image generation	- Input: text and image (1000 tokens and 1 image) - Output: 1 Image - Tool calling: No - Response formats: Image (PNG and JPG)
Stable Image Core	Image generation	- Input: text (1000 tokens) - Output: 1 Image - Tool calling: No - Response formats: Image (PNG and JPG)
Stable Image Ultra	Image generation	- Input: text (1000 tokens) - Output: 1 Image - Tool calling: No - Response formats: Image (PNG and JPG)

xAI

xAI's Grok 3 and Grok 3 Mini models are designed to excel in various enterprise domains. Grok 3, a non-reasoning model pre-trained by the Colossus datacenter, is tailored for business use cases such as data extraction, coding, and text summarization, with exceptional instruction-following capabilities. It supports a 131,072 token context window, allowing it to handle extensive inputs while maintaining coherence and depth, and is particularly adept at drawing connections across domains and languages. On the other hand, Grok 3 Mini is a lightweight reasoning model trained to tackle agentic, coding, mathematical, and deep science problems with test-time compute. It also supports a 131,072 token context window for understanding codebases and enterprise documents, and excels at using tools to solve complex logical problems in novel environments, offering raw reasoning traces for user inspection with adjustable thinking budgets.

Model	Type	Capabilities
grok-3	chat-completion	- Input: text (131,072 tokens) - Output: text (131,072 tokens) - Languages: `en` - Tool calling: yes - Response formats: text
grok-3-mini	chat-completion	- Input: text (131,072 tokens) - Output: text (131,072 tokens) - Languages: `en` - Tool calling: yes - Response formats: text

Inference examples: Stability AI

Stability AI models deployed via serverless API deployment implement the Model Inference API on the route /image/generations. For examples of how to use Stability AI models, see the following examples:

Share via

Azure AI Foundry Models available for serverless API deployment

AI21 Labs

Azure OpenAI

Cohere

Cohere command and embed

Inference examples: Cohere command and embed

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Cohere rerank

Pricing for Cohere rerank models

Core42

Inference examples: Core42

DeepSeek

Inference examples: DeepSeek

Meta

Inference examples: Meta Llama

Microsoft

Inference examples: Microsoft models

Mistral AI

Inference examples: Mistral

Nixtla

Estimate the number of tokens needed

Estimate pricing based on tokens

NTT DATA

Stability AI

xAI

Inference examples: Stability AI

Related content

Feedback

Additional resources