Share via

GPT Tool Output Ignored - Model Uses Training Data Prices Instead of Function Call Results

Tarandeep Singh Khurana 20 Reputation points
2026-03-20T16:56:19.65+00:00

Issue: When using function/tool calling with GPT models via Azure OpenAI API,

the model ignores exact numerical values from tool output and substitutes them with values

from its training data.

Environment:

  • Azure OpenAI API
  • Models: GPT
  • Temperature: 0 (also tested with more than 0)
  • Using standard tool/function calling

Reproduction Steps:

  1. User asks: "Apple stock price"
  2. Tool global_market_search returns: "$248.35" (from live Yahoo Finance data)
  3. Model outputs: "$148.35" (from training data, Apple was ~$150 in 2024)

Pattern Observed:

  • Apple: Tool says $248.35 → Model outputs $148.35 (drops leading "2")
  • Tesla: Tool says $387.27 → Model outputs $187.27 (drops leading "3")
  • The model keeps decimals but replaces leading digits with parametric memory

Evidence:

  • Full tool response logged shows correct price ($248.35)
  • Tool response is correctly appended to messages as ToolMessage
  • Model's reasoning (thinking tokens) sometimes shows correct price
  • Final generated output has wrong price

This does NOT happen with Indian stock prices (RELIANCE, TCS, NIFTY) because the model

has no strong parametric memory for those prices.

Questions:

  1. Is this a known issue with knowledge conflict in tool calling?
  2. What is the recommended approach to force the model to use exact tool output values?
  3. Does setting temperature=0 not guarantee verbatim usage of tool results?
  4. Should we use structured output/JSON mode for numerical data?
  5. We've tested with temperature=0 and the issue persists. ChatGPT's consumer product does NOT exhibit this behavior, it renders tool results in separate UI components along with text

Impact: Critical for financial applications where price accuracy is essential.

Related Research: "Adaptive Chameleon or Stubborn Sloth" (arXiv:2305.13300) - ICLR 2024

discusses this exact "knowledge conflict" behavior in LLMs.

Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.


2 answers

Sort by: Most helpful
  1. Sina Salam 28,361 Reputation points Volunteer Moderator
    2026-03-31T15:49:51.5+00:00

    Hello Tarandeep Singh Khurana,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your GPT Tool Output Ignored - Model Uses Training Data Prices Instead of Function Call Results.

    Following the below steps will help in resolving the issue:

    1. Use Structured outputs so the model must emit the exact tool values; pin both the string and numeric price via enum (or const if supported) and set strict to true. See: Structured outputs and the JSON‑schema class notes on supported keywords in SDKs (e.g., .JsonSchemaFormat) for limits and strict validation behavior. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/structured-outputs
         "response_format": {"type":"json_schema","json_schema":{
         "name":"price_payload","schema":{
          "type":"object","properties":{
            "ticker":{"type":"string"},
            "price_text":{"type":"string","enum":["$248.35"]},
            "price_value":{"type":"number","enum":[248.35]},
            "as_of":{"type":"string"}
          },"required":["ticker","price_text","price_value","as_of"],
          "additionalProperties":false
         },"strict":true
         }}
      
    2. Fetch live data via function/tool calling, then have the model write prose with placeholders and let your server inject the numbers before rendering; this mirrors Microsoft’s recommended function‑calling orchestration. See: Function calling how‑to and, if using agents, Agents + function tools.
    3.    template = "AAPL trades at {PRICE} as of {AS_OF}."
         rendered = template.replace("{PRICE}", price_text).replace("{AS_OF}", as_of)
      
    4. Add a post‑generation guardrail (block drift before users see it) by run the final text against Azure AI Content Safety – Groundedness using the tool payload as context; regenerate or block if ungrounded, especially for finance. See - Groundedness (concepts) and Groundedness quickstart. grounded = groundedness.check(response_text, context=tool_context) if not grounded: regenerate_or_block()
    5. Keep temperature low, but don’t rely on it for correctness—schema wins Set a low temperature (e.g., 0–0.2) and a seed if your stack supports it to steady style; accuracy still comes from schema + binding, not sampling knobs. See: Temperature semantics in the .NET inference SDK (doc) and prefer Structured outputs over plain JSON mode for strict adherence (guidance).
         {"model":"gpt-4.1","temperature":0.1,"seed":42,"response_format":{ "...": "json_schema as above" }}
      
    6. This is (Optional) Fine‑tune for tool‑use habits, not for truth, if you want fewer validation failures, fine‑tune on traces where the assistant calls the tool and returns schema‑shaped results, but keep the schema/binding gates—those enforce truth. See: Fine‑tuning for tool calling.
         {"messages":[
         {"role":"user","content":"Apple price?"},
         {"role":"assistant","tool_calls":[{"type":"function","function":{"name":"get_price","arguments":"{"ticker":"AAPL"}"}}]}
         ],
         "tools":[{"type":"function","function":{"name":"get_price","parameters":{"type":"object","properties":{"ticker":{"type":"string"}},"
      
    7. Validate, retry, and fail closed with tool data as the source of truth On 400s from strict schema checks, retry once with a simplified schema; if still failing, render the tool values directly and log for triage. Use the Responses/Chat APIs as the transport and keep schemas within supported subsets. See: Responses API guide and schema‑strictness notes from platform guidance on structured outputs (how‑to).
         try: result = client.responses.create(payload)
         except HTTPError as e:
          if e.status==400: show(tool_price); log(e)
      

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

  2. Anshika Varshney 9,740 Reputation points Microsoft External Staff Moderator
    2026-03-20T19:25:05.14+00:00

    Hi Tarandeep Singh Khurana,

    Thank you for reaching out on the Microsoft Q&A.

    what you are seeing can happen because the model is still a text generator at its core. Even when you provide fresh tool results, the model may sometimes mix that with what it already knows from training, especially for popular and frequently mentioned facts like US stock prices. Your examples show the tool returning one value, but the final message using a different value that looks like an older remembered price pattern.

    Here are practical ways to make the model rely on the tool value more reliably.

    First, make the tool result the only source of truth in the conversation. With function calling, the recommended pattern is that the model decides to call a function, your app runs the tool, and then you call the model again including the tool response so the final answer is produced from that tool output. This is the standard flow described in the function calling guidance.

    Second, return structured tool output and constrain the model response format. When numbers must be exact, free text responses are where mistakes happen. Microsoft recommends using structured outputs to make the model follow a schema, and structured outputs are recommended for function calling and multi step workflows. This reduces the chance the model will rewrite or “correct” the numeric value.

    Third, make your instructions explicit and simple. In your system message, say that the assistant must use the tool output values exactly and must not guess or use remembered values. Microsoft’s prompt engineering guidance explains that models generally produce the most likely continuation from training, so direct instructions help reduce drift when the model is tempted to answer from memory.

    Fourth, if you still see conflicts in important production scenarios, consider fine tuning with tool calling examples. Microsoft documents that fine tuning with tool calling examples can improve accuracy and consistency of tool calling outputs. This is usually a later step after you have tried clearer tool schemas and stronger instructions.

    Troubleshooting checklist, you can quickly validate in your repro

    Confirm the tool output is actually being passed back to the model in the second call as a tool message, and the model is asked to generate the final user facing answer only after it sees that tool output. This matches the function calling flow in the documentation. Try returning the price as a structured object with a dedicated numeric field and use structured outputs so the final response must place that exact number in the right field, then your app can display it. Keep temperature low for stability, but remember temperature does not guarantee exact copying, so the main fix is grounding and structure rather than randomness settings. Your own testing already shows it can happen even at temperature zero.

    References that should help

    Function calling in Azure OpenAI in Microsoft Foundry Models How to use function calling with Azure OpenAI in Microsoft Foundry Models

    Structured outputs in Azure OpenAI in Microsoft Foundry Models How to use structured outputs with Azure OpenAI in Microsoft Foundry Models

    Prompt engineering concepts that explain why models can fall back to prior knowledge Prompt engineering techniques

    Fine tuning with tool calling examples for better consistency Fine tuning function calls with Azure OpenAI in Microsoft Foundry Models

    I Hope this helps. Do let me know if you have any further queries.

    Thankyou!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.