Query foundation models by type

Important

This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Azure Databricks previews.

In this article, you learn how to write query requests for Databricks-hosted foundation models served by model services in Unity AI Gateway, organized by model type: chat, vision, audio and video, and reasoning.

Requirements

Note

The following examples are based on Unity AI Gateway and model services. If you use model serving endpoints instead of model services, replace the model service name with an endpoint name. See Discover foundation models for a list of available foundation models and their model service and endpoint names.

Chat

Foundation models that are optimized for chat and general purpose tasks.

The examples in this section show how to query a model service using the different client options.

OpenAI Chat Completions

To use the OpenAI client, specify the model service name as the model input. The following example assumes you have a Databricks API token and openai installed on your compute. You also need your Databricks workspace instance to connect the OpenAI client to Databricks.


import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('DATABRICKS_TOKEN'),
    base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_tokens=256
)

As an example, the following is the expected request format for a chat model when using the REST API.

{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

OpenAI Responses

Important

The Responses API is only compatible with OpenAI models.

To use the OpenAI Responses API, specify the model service name as the model input. The following example assumes you have an Azure Databricks API token and openai installed on your compute. You also need your Azure Databricks workspace instance to connect the OpenAI client to Azure Databricks.


import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('DATABRICKS_TOKEN'),
    base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)

response = client.responses.create(
    model="system.ai.gpt-5",
    input=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_output_tokens=256
)

As an example, the following is the expected request format when using the OpenAI Responses API. The URL path for this API is /serving-endpoints/responses.

{
  "model": "databricks-gpt-5",
  "input": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_output_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the Responses API:

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1698824353,
  "model": "databricks-gpt-5",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": []
    }
  ],
  "usage": {
    "input_tokens": 7,
    "output_tokens": 74,
    "total_tokens": 81
  }
}

REST API

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "model": "system.ai.claude-sonnet-4-5",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": " What is a mixture of experts model?"
    }
  ]
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions

As an example, the following is the expected request format for a chat model when using the REST API.

{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

Databricks Python SDK

This code must be run in a notebook in your workspace. See Use the Databricks SDK for Python from an Azure Databricks notebook.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole

w = WorkspaceClient()
response = w.serving_endpoints.query(
    name="system.ai.claude-sonnet-4-5",
    messages=[
        ChatMessage(
            role=ChatMessageRole.SYSTEM, content="You are a helpful assistant."
        ),
        ChatMessage(
            role=ChatMessageRole.USER, content="What is a mixture of experts model?"
        ),
    ],
    max_tokens=128,
)
print(f"RESPONSE:\n{response.choices[0].message.content}")

As an example, the following is the expected request format for a chat model when using the REST API.

{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

Vision

Query Databricks-hosted vision models through model services in Unity AI Gateway to understand and analyze images with a unified API.

OpenAI client

To use the OpenAI client, specify the model service name as the model input.


from openai import OpenAI
import base64
import requests

# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

client = OpenAI(
    api_key=API_TOKEN,
    base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)

# Download and encode image
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
resp = requests.get(image_url)
resp.raise_for_status()
image_data = base64.b64encode(resp.content).decode("utf-8")

# OpenAI request
completion = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "what's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
                },
            ],
        }
    ],
)

print(completion.choices[0].message.content)

The Chat Completions API supports multiple image inputs, allowing the model to analyze each image and synthesize information from all inputs to generate a response to the prompt.


from openai import OpenAI
import base64
import requests

# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

client = OpenAI(
    api_key=API_TOKEN,
    base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)

# Download and encode multiple images
image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp1 = requests.get(image1_url)
resp1.raise_for_status()
image1_data = base64.b64encode(resp1.content).decode("utf-8")

image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp2 = requests.get(image2_url)
resp2.raise_for_status()
image2_data = base64.b64encode(resp2.content).decode("utf-8")

# OpenAI request
completion = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What are in these images? Is there any difference between them?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image1_data}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image2_data}"},
                },
            ],
        }
    ],
)

print(completion.choices[0].message.content)

Input image requirements

Model(s) Supported formats Multiple images per request Image size limitations Image resizing recommendations Image quality considerations
databricks-gpt-5
  • JPEG
  • PNG
  • WebP
  • GIF (Non-animated GIF)
Up to 500 individual image inputs per request File size limit: Up to 10 MB total payload size per request N/A
  • No watermarks or logos
  • Clear enough for a human to understand
databricks-gpt-5-mini
  • JPEG
  • PNG
  • WebP
  • GIF (Non-animated GIF)
Up to 500 individual image inputs per request File size limit: Up to 10 MB total payload size per request N/A
  • No watermarks or logos
  • Clear enough for a human to understand
databricks-gpt-5-nano
  • JPEG
  • PNG
  • WebP
  • GIF (Non-animated GIF)
Up to 500 individual image inputs per request File size limit: Up to 10 MB total payload size per request N/A
  • No watermarks or logos
  • Clear enough for a human to understand
databricks-gemma-3-12b
  • JPEG
  • PNG
  • WebP
  • GIF
Up to 5 images for API requests
  • All provided images are processed in a request.
File size limit: 10 MB total across all images per API request N/A N/A
databricks-llama-4-maverick
  • JPEG
  • PNG
  • WebP
  • GIF
Up to 5 images for API requests
  • All provided images are processed in a request.
File size limit: 10 MB total across all images per API request N/A N/A
  • databricks-claude-sonnet-4-6
  • databricks-claude-sonnet-4-5
  • databricks-claude-haiku-4-5
  • databricks-claude-opus-4-8
  • databricks-claude-opus-4-7
  • databricks-claude-opus-4-6
  • databricks-claude-opus-4-5
  • databricks-claude-opus-4-1
  • databricks-claude-sonnet-4
  • JPEG
  • PNG
  • GIF
  • WebP
  • Up to 20 images for Claude.ai
  • Up to 100 images for API requests
  • All provided images are processed in a request, which is useful for comparing or contrasting them.
  • Images larger than 8000x8000 px are rejected.
  • If more than 20 images are submitted in one API request, the maximum allowed size per image is 2000 x 2000 px.
For optimal performance, resize images before uploading if they are too large.
  • If an image's long edge exceeds 1568 pixels or its size exceeds ~1,600 tokens, it is automatically scaled down while preserving aspect ratio.
  • Very small images (under 200 pixels on any edge) may degrade performance.
  • To reduce latency, keep images within 1.15 megapixels and at most 1568 pixels in both dimensions.
  • Clarity: Avoid blurry or pixelated images.
  • Text in images:
    • Ensure text is legible and not too smal.
    • Avoid cropping out key visual context just to enlarge the text.

Image to token conversion

Each image in a request to a foundation model adds to your token usage. See the pricing calculator to estimate image pricing based on the token usage and model you are using.

Limitations of image understanding

The following are image understanding limitations for the supported Databricks-hosted foundation models:

Model Limitations
The following Claude models are supported:
  • databricks-claude-sonnet-4-6
  • databricks-claude-sonnet-4-5
  • databricks-claude-opus-4-1
  • databricks-claude-sonnet-4
The following are the limits for Claude models on Databricks:
  • Avoid using Claude for tasks requiring perfect precision or sensitive analysis without human oversight.
  • People identification: Cannot identify or name people in images.
  • Accuracy: May misinterpret low-quality, rotated, or very small images (200 px).
  • Spatial reasoning: Struggles with precise layouts, such as reading analog clocks or chess positions.
  • Counting: Provides approximate counts, but may be inaccurate for many small objects.
  • AI-generated images: Cannot reliably detect synthetic or fake images.
  • Inappropriate content: Blocks explicit or policy-violating images.
  • Healthcare: Not suited for complex medical scans (for example, CTs and MRIs). It's not a diagnostic tool.

Audio and video

Send audio and video inputs to Gemini foundation models served by Unity AI Gateway on Azure Databricks. You can provide media as a URL or as base64-encoded inline data using the Chat Completions API or the Google Gemini API.

You can provide audio and video inputs using two methods:

  • URL: Pass a publicly accessible URL to the media file. For video, YouTube URLs are also supported.
  • Base64 inline data: Encode the file as a base64 string and pass it as a data URI (for example, data:video/mp4;base64,<encoded_data>).

Chat Completions API

The chat completions API allows you to pass video and audio input. Use the video_url and audio_url content types in the messages array to pass media inputs. Each content item includes a url field that accepts either a web URL or a base64 data URI.

The following examples show video and audio input using the Chat Completions API.

Python

import os
import base64
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')

client = OpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=DATABRICKS_BASE_URL
)

# Encode a local video file as base64
with open("video.mp4", "rb") as f:
    video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="system.ai.gemini-3-1-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Summarize what happens in these videos."},
            {
                "type": "video_url",
                "video_url": {"url": "https://example.com/sample-video.mp4"}
            },
            {
                "type": "video_url",
                "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}
            },
        ]
    }],
    max_tokens=1024
)

print(response.choices[0].message.content)
import os
import base64
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')

client = OpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=DATABRICKS_BASE_URL
)

# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
    audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="system.ai.gemini-3-1-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe this audio and summarize the key points."},
            {
                "type": "audio_url",
                "audio_url": {"url": "https://example.com/sample-audio.mp3"}
            },
            {
                "type": "audio_url",
                "audio_url": {"url": f"data:audio/mp3;base64,{audio_b64}"}
            },
        ]
    }],
    max_tokens=1024
)

print(response.choices[0].message.content)

REST API

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "model": "system.ai.gemini-3-1-pro",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Summarize what happens in these videos."},
      {
        "type": "video_url",
        "video_url": {"url": "https://example.com/sample-video.mp4"}
      },
      {
        "type": "video_url",
        "video_url": {"url": "data:video/mp4;base64,<base64_encoded_data>"}
      }
    ]
  }],
  "max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "model": "system.ai.gemini-3-1-pro",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Transcribe this audio and summarize the key points."},
      {
        "type": "audio_url",
        "audio_url": {"url": "https://example.com/sample-audio.mp3"}
      },
      {
        "type": "audio_url",
        "audio_url": {"url": "data:audio/mp3;base64,<base64_encoded_data>"}
      }
    ]
  }],
  "max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions

Google Gemini API

Use the Google Gemini API to pass media as inlineData (base64-encoded) or fileData (URL reference) within the parts array.

The following examples show video and audio input using the Google Gemini API.

Python

from google import genai
from google.genai import types
import base64
import os

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')

client = genai.Client(
    api_key="databricks",
    http_options=types.HttpOptions(
        base_url="https://<workspace-url>/ai-gateway/gemini",
        headers={
            "Authorization": f"Bearer {DATABRICKS_TOKEN}",
        },
    ),
)

# Encode a local video file as base64
with open("video.mp4", "rb") as f:
    video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.models.generate_content(
    model="system.ai.gemini-3-1-pro",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part(text="Summarize what happens in these videos."),
                types.Part(
                    file_data=types.FileData(
                        mime_type="video/mp4",
                        file_uri="https://example.com/sample-video.mp4",
                    )
                ),
                types.Part(
                    inline_data=types.Blob(
                        mime_type="video/mp4",
                        data=video_b64,
                    )
                ),
            ],
        ),
    ],
    config=types.GenerateContentConfig(
        max_output_tokens=1024,
    ),
)

print(response.text)
from google import genai
from google.genai import types
import base64
import os

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')

client = genai.Client(
    api_key="databricks",
    http_options=types.HttpOptions(
        base_url="https://<workspace-url>/ai-gateway/gemini",
        headers={
            "Authorization": f"Bearer {DATABRICKS_TOKEN}",
        },
    ),
)

# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
    audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.models.generate_content(
    model="system.ai.gemini-3-1-pro",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part(text="Transcribe this audio and summarize the key points."),
                types.Part(
                    file_data=types.FileData(
                        mime_type="audio/mp3",
                        file_uri="https://example.com/sample-audio.mp3",
                    )
                ),
                types.Part(
                    inline_data=types.Blob(
                        mime_type="audio/mp3",
                        data=audio_b64,
                    )
                ),
            ],
        ),
    ],
    config=types.GenerateContentConfig(
        max_output_tokens=1024,
    ),
)

print(response.text)

REST API

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "contents": [{
    "role": "user",
    "parts": [
      {"text": "Summarize what happens in these videos."},
      {
        "fileData": {
          "mimeType": "video/mp4",
          "fileUri": "https://example.com/sample-video.mp4"
        }
      },
      {
        "inlineData": {
          "mimeType": "video/mp4",
          "data": "<base64_encoded_data>"
        }
      }
    ]
  }]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "contents": [{
    "role": "user",
    "parts": [
      {"text": "Transcribe this audio and summarize the key points."},
      {
        "fileData": {
          "mimeType": "audio/mp3",
          "fileUri": "https://example.com/sample-audio.mp3"
        }
      },
      {
        "inlineData": {
          "mimeType": "audio/mp3",
          "data": "<base64_encoded_data>"
        }
      }
    ]
  }]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent

Limitations

  • Multiple audio or video inputs can be included in a single request, but large files increase latency and token usage.

Reasoning

Foundation models optimized for reasoning tasks. Databricks Foundation Model API provides a unified API to interact with all Foundation Models, including reasoning models. Reasoning gives foundation models enhanced capabilities to tackle complex tasks. Some models also provide transparency by revealing their step-by-step thought process before delivering a final answer.

Types of reasoning models

There are two types of models, reasoning-only and hybrid. The following table describes how different models use different approaches to control reasoning:

Reasoning model type Details Model examples Parameters
Hybrid reasoning Supports both fast, instant replies and deeper reasoning when needed. Claude models like databricks-claude-sonnet-4-6, databricks-claude-sonnet-4-5, databricks-claude-sonnet-4, databricks-claude-opus-4-8, databricks-claude-opus-4-7, databricks-claude-opus-4-6, databricks-claude-opus-4-5, and databricks-claude-opus-4-1. Include the following parameters to use hybrid reasoning:
  • thinking
  • budget_tokens: controls how many tokens the model can use for internal thought. Higher budgets can improve quality for complex tasks, but usage above 32K may vary. budget_tokens must be less than max_tokens.
Reasoning only These models always use internal reasoning in their responses. GPT OSS models like databricks-gpt-oss-120b and databricks-gpt-oss-20b. Use the following parameter in your request:
  • reasoning_effort: accepts values of "low", "medium" (default), or "high". Higher reasoning effort may result in more thoughtful and accurate responses but may increase latency and token usage. This parameter is only accepted by a limited set of models, including databricks-gpt-oss-120b and databricks-gpt-oss-20b.

Query examples

All reasoning models are accessed through the chat completions endpoint.

Claude model example

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
  base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
  )

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]

print("Reasoning:", reasoning)
print("Answer:", answer)

GPT-5.1

The reasoning_effort parameter for GPT-5.1 is set to none by default, but can be overridden in requests. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.

curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gpt-5-1",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "none"
  }'

GPT OSS model example

The reasoning_effort parameter accepts "low", "medium" (default), or "high" values. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.

curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gpt-oss-120b",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "high"
  }'

Gemini model example

This example uses system.ai.gemini-3-1-pro. The reasoning_effort parameter is set to "low" by default, but can be overridden in requests as seen in the following example.

curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gemini-3-1-pro",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 2000,
    "stream": true,
    "reasoning_effort": "high"
  }'

The API response includes both thinking and text content blocks:

ChatCompletionMessage(
    role="assistant",
    content=[
        {
            "type": "reasoning",
            "summary": [
                {
                    "type": "summary_text",
                    "text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
                    "signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
                }
            ]
        },
        {
            "type": "text",
            "text": (
                "# Why the Sky Is Blue\n\n"
                "The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
            )
        }
    ],
    refusal=None,
    annotations=None,
    audio=None,
    function_call=None,
    tool_calls=None
)

Manage reasoning across multiple turns

This section is specific to the databricks-claude-sonnet-4-5 model.

In multi-turn conversations, only the reasoning blocks associated with the last assistant turn or tool-use session are visible to the model and counted as input tokens.

If you don't want to pass reasoning tokens back to the model (for example, you don't need it to reason over its prior steps), you can omit the reasoning block entirely. For example:

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

However, if you do need the model to reason over its previous reasoning process - for instance, if you're building experiences that surface its intermediate reasoning - you must include the full, unmodified assistant message, including the reasoning block from the previous turn. Here's how to continue a thread with the full assistant message:

assistant_message = response.choices[0].message

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
        assistant_message,
        {"role": "user", "content": "Can you simplify the previous answer?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

How does a reasoning model work?

Reasoning models introduce special reasoning tokens in addition to the standard input and output tokens. These tokens let the model "think" through the prompt, breaking it down and considering different ways to respond. After this internal reasoning process, the model generates its final answer as visible output tokens. Some models, like databricks-claude-sonnet-4-5, display these reasoning tokens to users, while others, such as the OpenAI o series, discard them and do not expose them in the final output.

Supported models

See Discover foundation models for the available foundation models and the interaction types each supports, including chat, vision, audio and video, and reasoning.

Additional resources