Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Azure Databricks previews.
In this article, you learn how to write query requests for Databricks-hosted foundation models served by model services in Unity AI Gateway, organized by model type: chat, vision, audio and video, and reasoning.
Requirements
- See Requirements.
- Install the appropriate package to your cluster based on the querying client option you choose.
Note
The following examples are based on Unity AI Gateway and model services. If you use model serving endpoints instead of model services, replace the model service name with an endpoint name. See Discover foundation models for a list of available foundation models and their model service and endpoint names.
Chat
Foundation models that are optimized for chat and general purpose tasks.
The examples in this section show how to query a model service using the different client options.
OpenAI Chat Completions
To use the OpenAI client, specify the model service name as the model input. The following example assumes you have a Databricks API token and openai installed on your compute. You also need your Databricks workspace instance to connect the OpenAI client to Databricks.
import os
import openai
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('DATABRICKS_TOKEN'),
base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_tokens=256
)
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
OpenAI Responses
Important
The Responses API is only compatible with OpenAI models.
To use the OpenAI Responses API, specify the model service name as the model input. The following example assumes you have an Azure Databricks API token and openai installed on your compute. You also need your Azure Databricks workspace instance to connect the OpenAI client to Azure Databricks.
import os
import openai
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('DATABRICKS_TOKEN'),
base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)
response = client.responses.create(
model="system.ai.gpt-5",
input=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_output_tokens=256
)
As an example, the following is the expected request format when using the OpenAI Responses API. The URL path for this API is /serving-endpoints/responses.
{
"model": "databricks-gpt-5",
"input": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_output_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the Responses API:
{
"id": "resp_abc123",
"object": "response",
"created_at": 1698824353,
"model": "databricks-gpt-5",
"output": [
{
"type": "message",
"role": "assistant",
"content": []
}
],
"usage": {
"input_tokens": 7,
"output_tokens": 74,
"total_tokens": 81
}
}
REST API
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.claude-sonnet-4-5",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": " What is a mixture of experts model?"
}
]
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
Databricks Python SDK
This code must be run in a notebook in your workspace. See Use the Databricks SDK for Python from an Azure Databricks notebook.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole
w = WorkspaceClient()
response = w.serving_endpoints.query(
name="system.ai.claude-sonnet-4-5",
messages=[
ChatMessage(
role=ChatMessageRole.SYSTEM, content="You are a helpful assistant."
),
ChatMessage(
role=ChatMessageRole.USER, content="What is a mixture of experts model?"
),
],
max_tokens=128,
)
print(f"RESPONSE:\n{response.choices[0].message.content}")
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
Vision
Query Databricks-hosted vision models through model services in Unity AI Gateway to understand and analyze images with a unified API.
OpenAI client
To use the OpenAI client, specify the model service name as the model input.
from openai import OpenAI
import base64
import requests
# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
client = OpenAI(
api_key=API_TOKEN,
base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)
# Download and encode image
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
resp = requests.get(image_url)
resp.raise_for_status()
image_data = base64.b64encode(resp.content).decode("utf-8")
# OpenAI request
completion = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "what's in this image?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
}
],
)
print(completion.choices[0].message.content)
The Chat Completions API supports multiple image inputs, allowing the model to analyze each image and synthesize information from all inputs to generate a response to the prompt.
from openai import OpenAI
import base64
import requests
# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
client = OpenAI(
api_key=API_TOKEN,
base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)
# Download and encode multiple images
image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp1 = requests.get(image1_url)
resp1.raise_for_status()
image1_data = base64.b64encode(resp1.content).decode("utf-8")
image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp2 = requests.get(image2_url)
resp2.raise_for_status()
image2_data = base64.b64encode(resp2.content).decode("utf-8")
# OpenAI request
completion = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What are in these images? Is there any difference between them?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image1_data}"},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image2_data}"},
},
],
}
],
)
print(completion.choices[0].message.content)
Input image requirements
| Model(s) | Supported formats | Multiple images per request | Image size limitations | Image resizing recommendations | Image quality considerations |
|---|---|---|---|---|---|
databricks-gpt-5 |
|
Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
databricks-gpt-5-mini |
|
Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
databricks-gpt-5-nano |
|
Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
databricks-gemma-3-12b |
|
Up to 5 images for API requests
|
File size limit: 10 MB total across all images per API request | N/A | N/A |
databricks-llama-4-maverick |
|
Up to 5 images for API requests
|
File size limit: 10 MB total across all images per API request | N/A | N/A |
|
|
|
|
For optimal performance, resize images before uploading if they are too large.
|
|
Image to token conversion
Each image in a request to a foundation model adds to your token usage. See the pricing calculator to estimate image pricing based on the token usage and model you are using.
Limitations of image understanding
The following are image understanding limitations for the supported Databricks-hosted foundation models:
| Model | Limitations |
|---|---|
The following Claude models are supported:
|
The following are the limits for Claude models on Databricks:
|
Audio and video
Send audio and video inputs to Gemini foundation models served by Unity AI Gateway on Azure Databricks. You can provide media as a URL or as base64-encoded inline data using the Chat Completions API or the Google Gemini API.
You can provide audio and video inputs using two methods:
- URL: Pass a publicly accessible URL to the media file. For video, YouTube URLs are also supported.
- Base64 inline data: Encode the file as a base64 string and pass it as a data URI (for example,
data:video/mp4;base64,<encoded_data>).
Chat Completions API
The chat completions API allows you to pass video and audio input. Use the video_url and audio_url content types in the messages array to pass media inputs. Each content item includes a url field that accepts either a web URL or a base64 data URI.
The following examples show video and audio input using the Chat Completions API.
Python
import os
import base64
from openai import OpenAI
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')
client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=DATABRICKS_BASE_URL
)
# Encode a local video file as base64
with open("video.mp4", "rb") as f:
video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="system.ai.gemini-3-1-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Summarize what happens in these videos."},
{
"type": "video_url",
"video_url": {"url": "https://example.com/sample-video.mp4"}
},
{
"type": "video_url",
"video_url": {"url": f"data:video/mp4;base64,{video_b64}"}
},
]
}],
max_tokens=1024
)
print(response.choices[0].message.content)
import os
import base64
from openai import OpenAI
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')
client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=DATABRICKS_BASE_URL
)
# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="system.ai.gemini-3-1-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio and summarize the key points."},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/sample-audio.mp3"}
},
{
"type": "audio_url",
"audio_url": {"url": f"data:audio/mp3;base64,{audio_b64}"}
},
]
}],
max_tokens=1024
)
print(response.choices[0].message.content)
REST API
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gemini-3-1-pro",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Summarize what happens in these videos."},
{
"type": "video_url",
"video_url": {"url": "https://example.com/sample-video.mp4"}
},
{
"type": "video_url",
"video_url": {"url": "data:video/mp4;base64,<base64_encoded_data>"}
}
]
}],
"max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gemini-3-1-pro",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio and summarize the key points."},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/sample-audio.mp3"}
},
{
"type": "audio_url",
"audio_url": {"url": "data:audio/mp3;base64,<base64_encoded_data>"}
}
]
}],
"max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
Google Gemini API
Use the Google Gemini API to pass media as inlineData (base64-encoded) or fileData (URL reference) within the parts array.
The following examples show video and audio input using the Google Gemini API.
Python
from google import genai
from google.genai import types
import base64
import os
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
client = genai.Client(
api_key="databricks",
http_options=types.HttpOptions(
base_url="https://<workspace-url>/ai-gateway/gemini",
headers={
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
},
),
)
# Encode a local video file as base64
with open("video.mp4", "rb") as f:
video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.models.generate_content(
model="system.ai.gemini-3-1-pro",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="Summarize what happens in these videos."),
types.Part(
file_data=types.FileData(
mime_type="video/mp4",
file_uri="https://example.com/sample-video.mp4",
)
),
types.Part(
inline_data=types.Blob(
mime_type="video/mp4",
data=video_b64,
)
),
],
),
],
config=types.GenerateContentConfig(
max_output_tokens=1024,
),
)
print(response.text)
from google import genai
from google.genai import types
import base64
import os
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
client = genai.Client(
api_key="databricks",
http_options=types.HttpOptions(
base_url="https://<workspace-url>/ai-gateway/gemini",
headers={
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
},
),
)
# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.models.generate_content(
model="system.ai.gemini-3-1-pro",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="Transcribe this audio and summarize the key points."),
types.Part(
file_data=types.FileData(
mime_type="audio/mp3",
file_uri="https://example.com/sample-audio.mp3",
)
),
types.Part(
inline_data=types.Blob(
mime_type="audio/mp3",
data=audio_b64,
)
),
],
),
],
config=types.GenerateContentConfig(
max_output_tokens=1024,
),
)
print(response.text)
REST API
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"role": "user",
"parts": [
{"text": "Summarize what happens in these videos."},
{
"fileData": {
"mimeType": "video/mp4",
"fileUri": "https://example.com/sample-video.mp4"
}
},
{
"inlineData": {
"mimeType": "video/mp4",
"data": "<base64_encoded_data>"
}
}
]
}]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"role": "user",
"parts": [
{"text": "Transcribe this audio and summarize the key points."},
{
"fileData": {
"mimeType": "audio/mp3",
"fileUri": "https://example.com/sample-audio.mp3"
}
},
{
"inlineData": {
"mimeType": "audio/mp3",
"data": "<base64_encoded_data>"
}
}
]
}]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent
Limitations
- Multiple audio or video inputs can be included in a single request, but large files increase latency and token usage.
Reasoning
Foundation models optimized for reasoning tasks. Databricks Foundation Model API provides a unified API to interact with all Foundation Models, including reasoning models. Reasoning gives foundation models enhanced capabilities to tackle complex tasks. Some models also provide transparency by revealing their step-by-step thought process before delivering a final answer.
Types of reasoning models
There are two types of models, reasoning-only and hybrid. The following table describes how different models use different approaches to control reasoning:
| Reasoning model type | Details | Model examples | Parameters |
|---|---|---|---|
| Hybrid reasoning | Supports both fast, instant replies and deeper reasoning when needed. | Claude models like databricks-claude-sonnet-4-6, databricks-claude-sonnet-4-5, databricks-claude-sonnet-4, databricks-claude-opus-4-8, databricks-claude-opus-4-7, databricks-claude-opus-4-6, databricks-claude-opus-4-5, and databricks-claude-opus-4-1. |
Include the following parameters to use hybrid reasoning:
|
| Reasoning only | These models always use internal reasoning in their responses. | GPT OSS models like databricks-gpt-oss-120b and databricks-gpt-oss-20b. |
Use the following parameter in your request:
|
Query examples
All reasoning models are accessed through the chat completions endpoint.
Claude model example
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
)
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[{"role": "user", "content": "Why is the sky blue?"}],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]
print("Reasoning:", reasoning)
print("Answer:", answer)
GPT-5.1
The reasoning_effort parameter for GPT-5.1 is set to none by default, but can be overridden in requests. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gpt-5-1",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 4096,
"reasoning_effort": "none"
}'
GPT OSS model example
The reasoning_effort parameter accepts "low", "medium" (default), or "high" values. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 4096,
"reasoning_effort": "high"
}'
Gemini model example
This example uses system.ai.gemini-3-1-pro. The reasoning_effort parameter is set to "low" by default, but can be overridden in requests as seen in the following example.
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gemini-3-1-pro",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 2000,
"stream": true,
"reasoning_effort": "high"
}'
The API response includes both thinking and text content blocks:
ChatCompletionMessage(
role="assistant",
content=[
{
"type": "reasoning",
"summary": [
{
"type": "summary_text",
"text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
"signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
}
]
},
{
"type": "text",
"text": (
"# Why the Sky Is Blue\n\n"
"The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
)
}
],
refusal=None,
annotations=None,
audio=None,
function_call=None,
tool_calls=None
)
Manage reasoning across multiple turns
This section is specific to the databricks-claude-sonnet-4-5 model.
In multi-turn conversations, only the reasoning blocks associated with the last assistant turn or tool-use session are visible to the model and counted as input tokens.
If you don't want to pass reasoning tokens back to the model (for example, you don't need it to reason over its prior steps), you can omit the reasoning block entirely. For example:
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": text_content},
{"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)
However, if you do need the model to reason over its previous reasoning process - for instance, if you're building experiences that surface its intermediate reasoning - you must include the full, unmodified assistant message, including the reasoning block from the previous turn. Here's how to continue a thread with the full assistant message:
assistant_message = response.choices[0].message
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": text_content},
{"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
assistant_message,
{"role": "user", "content": "Can you simplify the previous answer?"}
],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)
How does a reasoning model work?
Reasoning models introduce special reasoning tokens in addition to the standard input and output tokens. These tokens let the model "think" through the prompt, breaking it down and considering different ways to respond. After this internal reasoning process, the model generates its final answer as visible output tokens. Some models, like databricks-claude-sonnet-4-5, display these reasoning tokens to users, while others, such as the OpenAI o series, discard them and do not expose them in the final output.
Supported models
See Discover foundation models for the available foundation models and the interaction types each supports, including chat, vision, audio and video, and reasoning.