Serverless API inference examples for Foundry Models

Note

This document refers to the Microsoft Foundry (classic) portal.

🔍 View the Microsoft Foundry (new) documentation to learn about the new portal.

The Foundry model catalog offers a large selection of Microsoft Foundry Models from a wide range of providers. You have various options for deploying models from the model catalog. This article lists inference examples for serverless API deployments.

Important

Models that are in preview are marked as preview on their model cards in the model catalog.

To perform inferencing with the models, some models such as Nixtla's TimeGEN-1 and Cohere rerank require you to use custom APIs from the model providers. Others support inferencing using the Model Inference API. You can find more details about individual models by reviewing their model cards in the model catalog for Foundry portal.

Cohere

The Cohere family of models includes various models optimized for different use cases, including rerank, chat completions, and embeddings models.

Inference examples: Cohere command and embed

The following table provides links to examples of how to use Cohere models.

Description	Language	Sample
Web requests	Bash	Command-R Command-R+ cohere-embed.ipynb
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
OpenAI SDK (experimental)	Python	Link
LangChain	Python	Link
Cohere SDK	Python	Command Embed
LiteLLM SDK	Python	Link

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Description	Packages	Sample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain	`langchain`, `langchain_cohere`	cohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain	`langchain`, `langchain_cohere`	command_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain	`langchain`, `langchain_cohere`	cohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK	`cohere`, `azure_search_documents`	cohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChain	`cohere`, `langchain`, `langchain_cohere`	command_tools-langchain.ipynb

Cohere rerank

To perform inferencing with Cohere rerank models, you're required to use Cohere's custom rerank APIs. For more information on the Cohere rerank model and its capabilities, see Cohere rerank.

Pricing for Cohere rerank models

Queries, not to be confused with a user's query, is a pricing meter that refers to the cost associated with the tokens used as input for inference of a Cohere Rerank model. Cohere counts a single search unit as a query with up to 100 documents to be ranked. Documents longer than 500 tokens (for Cohere-rerank-v3.5) or longer than 4096 tokens (for Cohere-rerank-v3-English and Cohere-rerank-v3-multilingual) when including the length of the search query are split up into multiple chunks, where each chunk counts as a single document.

See the Cohere model collection in Foundry portal.

Core42

The following table provides links to examples of how to use Jais models.

Description	Language	Sample
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link

DeepSeek

DeepSeek family of models includes DeepSeek-R1, which excels at reasoning tasks using a step-by-step training process, such as language, scientific reasoning, and coding tasks, DeepSeek-V3-0324, a Mixture-of-Experts (MoE) language model, and more.

The following table provides links to examples of how to use DeepSeek models.

Description	Language	Sample
Azure AI Inference package for Python	Python	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for Java	Java	Link

Microsoft

Microsoft models include various model groups such as MAI models, Phi models, healthcare AI models, and more. To see all the available Microsoft models, view the Microsoft model collection in Foundry portal.

The following table provides links to examples of how to use Microsoft models.

Description	Language	Sample
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
LangChain	Python	Link
Llama-Index	Python	Link

See the Microsoft model collection in Foundry portal.

Mistral AI

Mistral AI offers two categories of models, namely:

Premium models: These include Mistral Large, Mistral Small, Mistral-OCR-2503, Mistral Medium 3 (25.05), and Ministral 3B models, and are available as serverless APIs with pay-as-you-go token-based billing.
Open models: These include Mistral-small-2503, Codestral, and Mistral Nemo (that are available as serverless APIs with pay-as-you-go token-based billing), and Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01(that are available to download and run on self-hosted managed endpoints).

The following table provides links to examples of how to use Mistral models.

Description	Language	Sample
CURL request	Bash	Link
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
Python web requests	Python	Link
OpenAI SDK (experimental)	Python	Mistral - OpenAI SDK sample
LangChain	Python	Mistral - LangChain sample
Mistral AI	Python	Mistral - Mistral AI sample
LiteLLM	Python	Mistral - LiteLLM sample

Nixtla

Nixtla's TimeGEN-1 is a generative pre-trained forecasting and anomaly detection model for time series data. TimeGEN-1 can produce accurate forecasts for new time series without training, using only historical values and exogenous covariates as inputs.

To perform inferencing, TimeGEN-1 requires you to use Nixtla's custom inference API. For more information on the TimeGEN-1 model and its capabilities, see Nixtla.

Estimate the number of tokens needed

Before you create a TimeGEN-1 deployment, it's useful to estimate the number of tokens that you plan to consume and be billed for. One token corresponds to one data point in your input dataset or output dataset.

Suppose you have the following input time series dataset:

Unique_id	Timestamp	Target Variable	Exogenous Variable 1	Exogenous Variable 2
BE	2016-10-22 00:00:00	70.00	49593.0	57253.0
BE	2016-10-22 01:00:00	37.10	46073.0	51887.0

To determine the number of tokens, multiply the number of rows (in this example, two) and the number of columns used for forecasting—not counting the unique_id and timestamp columns (in this example, three) to get a total of six tokens.

Given the following output dataset:

Unique_id	Timestamp	Forecasted Target Variable
BE	2016-10-22 02:00:00	46.57
BE	2016-10-22 03:00:00	48.57

You can also determine the number of tokens by counting the number of data points returned after data forecasting. In this example, the number of tokens is two.

Estimate pricing based on tokens

There are four pricing meters that determine the price you pay. These meters are as follows:

Pricing Meter	Description
paygo-inference-input-tokens	Costs associated with the tokens used as input for inference when finetune_steps = 0
paygo-inference-output-tokens	Costs associated with the tokens used as output for inference when finetune_steps = 0
paygo-finetuned-model-inference-input-tokens	Costs associated with the tokens used as input for inference when finetune_steps > 0
paygo-finetuned-model-inference-output-tokens	Costs associated with the tokens used as output for inference when finetune_steps > 0

See the Nixtla model collection in Foundry portal.

Stability AI

Stability AI models deployed via serverless API deployment implement the Model Inference API on the route /image/generations. For examples of how to use Stability AI models, see the following examples:

Gretel Navigator

Gretel Navigator employs a compound AI architecture specifically engineered for synthetic data, by combining top open-source small language models (SLMs) fine-tuned across more than 10 industry domains. This purpose-built system creates diverse, domain-specific datasets at scales of hundreds to millions of examples. The system also preserves complex statistical relationships and offers increased speed and accuracy compared to manual data creation.

Description	Language	Sample
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link

Feedback

Was this page helpful?

Last updated on 2025-12-09