How do I evaluate reasoning models (o3-mini) using Azure AI Foundry evaluation or Prompt Flow?

Kavishka Gamage 0 Reputation points
2025-03-20T09:03:30.6633333+00:00

I have tried to evaluate the o3-mini model using an existing dataset, via Azure Machine Learning Workspace PromptFlow, as well as in Azure AI Foundry Evaluation, Azure OpenAI Evaluation, and PromptFlow options. However, it failed due to the parameter differences between GPT models and reasoning models.

What are alternative ways to evaluate the o3-mini model to benchmark it against an existing dataset?

 Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'type': 'invalid_request_error', 'param': 'max_tokens', 'code': 'unsupported_parameter'}}
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,379 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 31,391 Reputation points
    2025-03-20T11:04:15.4166667+00:00

    Hello Kavishka !

    Thank you for posting on Microsoft Learn.

    Unlike standard GPT models, reasoning models like o3-mini use:

    • max_completion_tokens instead of max_tokens
    • temperature (for randomness control)
    • top_p (nucleus sampling)

    If you are using Prompt Flow within Azure AI Studio, your YAML or JSON payload should include the parameter:

    parameters:
      model: "o3-mini"
      max_completion_tokens: 512
      temperature: 0.7
      top_p: 0.95
      prompt: "Evaluate the following reasoning dataset..."
    

    For Azure AI Foundry Evaluation, if you are running evaluations via API or SDK, update your payload:

    {
      "model": "o3-mini",
      "max_completion_tokens": 512,
      "temperature": 0.7,
      "top_p": 0.95,
      "input": "Evaluate the reasoning ability of this dataset..."
    }
    
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.