How to use Cohere Embed V3 models with Azure AI Foundry

Cohere embedding models

The Cohere family of models for embeddings includes the following models:

Cohere Embed v3 - English
Cohere Embed v3 - Multilingual

Cohere Embed English is a multimodal (text and image) representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes:

Embed English has 1,024 dimensions
Context window of the model is 512 tokens
Embed English accepts images as a base64 encoded data url

Image embeddings consume a fixed number of tokens per image—1,000 tokens per image—which translates to a price of $0.0001 per image embedded. The size or resolution of the image doesn't affect the number of tokens consumed, provided the image is within the accepted dimensions, file size, and formats.

Prerequisites

To use Cohere Embed V3 models with Azure AI Foundry, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

The inference package installed

You can consume predictions from this model by using the azure-ai-inference package with Python. To install this package, you need the following prerequisites:

Python 3.8 or later installed, including pip.
The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure AI inference package with the following command:

pip install azure-ai-inference

Read more about the Azure AI inference package and reference.

Tip

Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check Cohere documentation.

Work with embeddings

In this section, you use the Azure AI model inference API with an embeddings model.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import os
from azure.ai.inference import EmbeddingsClient
from azure.core.credentials import AzureKeyCredential

model = EmbeddingsClient(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]),
)

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

model_info = model.get_model_info()

The response is as follows:

print("Model name:", model_info.model_name)
print("Model type:", model_info.model_type)
print("Model provider name:", model_info.model_provider)

Model name: Cohere-embed-v3-english
Model type": embeddings
Model provider name": Cohere

Create embeddings

Create an embedding request to see the output of the model.

response = model.embed(
    input=["The ultimate answer to the question of life"],
)

Tip

The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings.

The response is as follows, where you can see the model's usage statistics:

import numpy as np

for embed in response.data:
    print("Embeding of size:", np.asarray(embed.embedding).shape)

print("Model:", response.model)
print("Usage:", response.usage)

It can be useful to compute embeddings in input batches. The parameter inputs can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position.

response = model.embed(
    input=[
        "The ultimate answer to the question of life", 
        "The largest planet in our solar system is Jupiter",
    ],
)

The response is as follows, where you can see the model's usage statistics:

import numpy as np

for embed in response.data:
    print("Embeding of size:", np.asarray(embed.embedding).shape)

print("Model:", response.model)
print("Usage:", response.usage)

Tip

Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit.

Create different types of embeddings

Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns.

The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database:

from azure.ai.inference.models import EmbeddingInputType

response = model.embed(
    input=["The answer to the ultimate question of life, the universe, and everything is 42"],
    input_type=EmbeddingInputType.DOCUMENT,
)

When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance.

from azure.ai.inference.models import EmbeddingInputType

response = model.embed(
    input=["What's the ultimate meaning of life?"],
    input_type=EmbeddingInputType.QUERY,
)

Cohere Embed V3 models can optimize the embeddings based on its use case.

Cohere embedding models

The Cohere family of models for embeddings includes the following models:

Cohere Embed v3 - English
Cohere Embed v3 - Multilingual

Cohere Embed English is a multimodal (text and image) representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes:

Embed English has 1,024 dimensions
Context window of the model is 512 tokens
Embed English accepts images as a base64 encoded data url

Image embeddings consume a fixed number of tokens per image—1,000 tokens per image—which translates to a price of $0.0001 per image embedded. The size or resolution of the image doesn't affect the number of tokens consumed, provided the image is within the accepted dimensions, file size, and formats.

Prerequisites

To use Cohere Embed V3 models with Azure AI Foundry, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

The inference package installed

You can consume predictions from this model by using the @azure-rest/ai-inference package from npm. To install this package, you need the following prerequisites:

LTS versions of Node.js with npm.
The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command:

npm install @azure-rest/ai-inference

Tip

Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check Cohere documentation.

Work with embeddings

In this section, you use the Azure AI model inference API with an embeddings model.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new ModelClient(
    process.env.AZURE_INFERENCE_ENDPOINT, 
    new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL)
);

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

await client.path("/info").get()

The response is as follows:

console.log("Model name: ", model_info.body.model_name);
console.log("Model type: ", model_info.body.model_type);
console.log("Model provider name: ", model_info.body.model_provider_name);

Model name: Cohere-embed-v3-english
Model type": embeddings
Model provider name": Cohere

Create embeddings

Create an embedding request to see the output of the model.

var response = await client.path("/embeddings").post({
    body: {
        input: ["The ultimate answer to the question of life"],
    }
});

Tip

The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings.

The response is as follows, where you can see the model's usage statistics:

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log(response.embedding);
console.log(response.body.model);
console.log(response.body.usage);

It can be useful to compute embeddings in input batches. The parameter inputs can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position.

var response = await client.path("/embeddings").post({
    body: {
        input: [
            "The ultimate answer to the question of life", 
            "The largest planet in our solar system is Jupiter",
        ],
    }
});

The response is as follows, where you can see the model's usage statistics:

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log(response.embedding);
console.log(response.body.model);
console.log(response.body.usage);

Tip

Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit.

Create different types of embeddings

Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns.

The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database:

var response = await client.path("/embeddings").post({
    body: {
        input: ["The answer to the ultimate question of life, the universe, and everything is 42"],
        input_type: "document",
    }
});

When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance.

var response = await client.path("/embeddings").post({
    body: {
        input: ["What's the ultimate meaning of life?"],
        input_type: "query",
    }
});

Cohere Embed V3 models can optimize the embeddings based on its use case.

Cohere embedding models

The Cohere family of models for embeddings includes the following models:

Cohere Embed v3 - English
Cohere Embed v3 - Multilingual

Cohere Embed English is a multimodal (text and image) representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes:

Embed English has 1,024 dimensions
Context window of the model is 512 tokens
Embed English accepts images as a base64 encoded data url

Image embeddings consume a fixed number of tokens per image—1,000 tokens per image—which translates to a price of $0.0001 per image embedded. The size or resolution of the image doesn't affect the number of tokens consumed, provided the image is within the accepted dimensions, file size, and formats.

Prerequisites

To use Cohere Embed V3 models with Azure AI Foundry, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

A REST client

Models deployed with the Azure AI model inference API can be consumed using any REST client. To use the REST client, you need the following prerequisites:

To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Tip

Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check Cohere documentation.

Work with embeddings

In this section, you use the Azure AI model inference API with an embeddings model.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

GET /info HTTP/1.1
Host: <ENDPOINT_URI>
Authorization: Bearer <TOKEN>
Content-Type: application/json

The response is as follows:

{
    "model_name": "Cohere-embed-v3-english",
    "model_type": "embeddings",
    "model_provider_name": "Cohere"
}

Create embeddings

Create an embedding request to see the output of the model.

{
    "input": [
        "The ultimate answer to the question of life"
    ]
}

Tip

The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings.

The response is as follows, where you can see the model's usage statistics:

{
    "id": "0ab1234c-d5e6-7fgh-i890-j1234k123456",
    "object": "list",
    "data": [
        {
            "index": 0,
            "object": "embedding",
            "embedding": [
                0.017196655,
                // ...
                -0.000687122,
                -0.025054932,
                -0.015777588
            ]
        }
    ],
    "model": "Cohere-embed-v3-english",
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 0,
        "total_tokens": 9
    }
}

It can be useful to compute embeddings in input batches. The parameter inputs can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position.

{
    "input": [
        "The ultimate answer to the question of life", 
        "The largest planet in our solar system is Jupiter"
    ]
}

The response is as follows, where you can see the model's usage statistics:

{
    "id": "0ab1234c-d5e6-7fgh-i890-j1234k123456",
    "object": "list",
    "data": [
        {
            "index": 0,
            "object": "embedding",
            "embedding": [
                0.017196655,
                // ...
                -0.000687122,
                -0.025054932,
                -0.015777588
            ]
        },
        {
            "index": 1,
            "object": "embedding",
            "embedding": [
                0.017196655,
                // ...
                -0.000687122,
                -0.025054932,
                -0.015777588
            ]
        }
    ],
    "model": "Cohere-embed-v3-english",
    "usage": {
        "prompt_tokens": 19,
        "completion_tokens": 0,
        "total_tokens": 19
    }
}

Tip

Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit.

Create different types of embeddings

Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns.

The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database:

{
    "input": [
        "The answer to the ultimate question of life, the universe, and everything is 42"
    ],
    "input_type": "document"
}

When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance.

{
    "input": [
        "What's the ultimate meaning of life?"
    ],
    "input_type": "query"
}

Cohere Embed V3 models can optimize the embeddings based on its use case.

Description	Language	Sample
Web requests	Bash	cohere-embed.ipynb
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
OpenAI SDK (experimental)	Python	Link
LangChain	Python	Link
Cohere SDK	Python	Link
LiteLLM SDK	Python	Link

Description	Packages	Sample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain	`langchain`, `langchain_cohere`	cohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain	`langchain`, `langchain_cohere`	command_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain	`langchain`, `langchain_cohere`	cohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK	`cohere`, `azure_search_documents`	cohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChain	`cohere`, `langchain`, `langchain_cohere`	command_tools-langchain.ipynb

Share via

Cohere embedding models

Prerequisites

A model deployment

The inference package installed

Work with embeddings

Create a client to consume the model

Get the model's capabilities

Create embeddings

Create different types of embeddings

Cohere embedding models

Prerequisites

A model deployment

The inference package installed

Work with embeddings

Create a client to consume the model

Get the model's capabilities

Create embeddings

Create different types of embeddings

Cohere embedding models

Prerequisites

A model deployment

A REST client

Work with embeddings

Create a client to consume the model

Get the model's capabilities

Create embeddings

Create different types of embeddings

More inference examples

Retrieval Augmented Generation (RAG) and tool use samples

Cost and quota considerations for Cohere family of models deployed as serverless API endpoints

Feedback

Additional resources

Share via

How to use Cohere Embed V3 models with Azure AI Foundry

Cohere embedding models

Prerequisites

A model deployment

The inference package installed

Work with embeddings

Create a client to consume the model

Get the model's capabilities

Create embeddings

Create different types of embeddings

Cohere embedding models

Prerequisites

A model deployment

The inference package installed

Work with embeddings

Create a client to consume the model

Get the model's capabilities

Create embeddings

Create different types of embeddings

Cohere embedding models

Prerequisites

A model deployment

A REST client

Work with embeddings

Create a client to consume the model

Get the model's capabilities

Create embeddings

Create different types of embeddings

More inference examples

Retrieval Augmented Generation (RAG) and tool use samples

Cost and quota considerations for Cohere family of models deployed as serverless API endpoints

Related content

Feedback

Additional resources