Retrieving token usage in Azure OpenAI response when streaming is enabled

Question

Retrieving token usage in Azure OpenAI response when streaming is enabled

chaymr 186

I have an Azure OpenAI deployment used by multiple internal users that charges back based on token usage found in the "usage" field of the API response. However, users who stream the response with "stream=True" do not receive the "usage" field in the Azure OpenAI response. Is there any way to retrieve the token count even with "stream=True"?.

2 answers

Your answer

Answer 1

Ramr-msft 17,826

Thanks for the question, Here is the sample for token count for stream enabled. Jupyter notebooks to calculate tokens usage with Tiktoken for scenarios with and without Token Streaming. https://github.com/LazaUK/AOAI-Streaming-TokenUsage/tree/main

Hessel Wellema 256 Reputation points

2024-03-05T16:46:26.64+00:00

This is about guessing the token usage. What we all need is the usage object that we get when we don't use streaming. Why is it so hard to return that with the final block(s)
WanisElabbar-4383 205 Reputation points

2024-07-16T09:55:54.41+00:00

This doesn't answer the question. It should be an API response similar to what OpenAI are offering today.
Luis Rueda 0 Reputation points

2024-09-02T19:57:10.7466667+00:00

That is not an appropriate response, calculating tokens is inexact. Many models on Azure are currently lacking to return the appropriate token usage.

Here are some models I've tested on Azure that do not return the appropriate token usage count when streaming responses.

ChatGPT 4 turbo
ChatGPT 4o
All Llama models including 405b
Hessel Wellema 256 Reputation points

2024-09-03T07:48:40.46+00:00

I noticed that the stream of GPT4o mini has a usage event object whren you add the right parameters to the api call. For some reason GPT4o does not.

Answer 2

Posting the answer here in case it helps others.

The stream_options: {"include_usage": True} option must be set using the model_extras keyword argument in the Azure client:


from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.inference.models import SystemMessage, UserMessage


client = ChatCompletionsClient(
    endpoint=os.getenv("<YOUR ENDPOINT ENV VAR>"),
    credential=AzureKeyCredential("<YOUR AZURE KEY ENV VAR>")
)

for chunk in client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="I am going to Paris, what should I see?")
    ],
    stream=True,
    model_extras={"stream_options": {"include_usage": True}}
):
    if hasattr(chunk, "usage") and chunk.usage is not None:
        print(chunk.usage)

For the last chunk received, prints:

{'completion_tokens': 561, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens': 28, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}, 'total_tokens': 589}

Share via

Retrieving token usage in Azure OpenAI response when streaming is enabled

2 answers

Your answer