How can I accurately count tokens used in OpenAI services?

Question

How can I accurately count tokens used in OpenAI services?

Simone Gallo 0

Hello,

I am having trouble understanding how tokens are actually counted when using the "function calling" feature in OpenAI services.

In the response received after calling the OpenAI service, "total tokens" (including those in the prompt and completion) are reported under the "usage" property. However, when I check the tokens used (and corresponding expenses) on the Azure monitoring page, a much higher number of tokens is displayed.

For example, the API response JSON reports 5k tokens, while the monitoring page shows about 21k tokens.

It seems that this discrepancy only occurs when the model decides to call functions. Is there a way to get real-time tracking of the tokens actually used?

Thank you.

Pramod Valavala 20,656 Reputation points Microsoft Employee Moderator

2023-11-15T18:23:44.63+00:00

@Simone Gallo Just checking in to see if the response here helps

2 answers

Your answer

Pramod Valavala 20,656 Reputation points Microsoft Employee Moderator

2023-11-15T18:23:44.63+00:00

@Simone Gallo Just checking in to see if the response here helps

Answer 1

@Simone Gallo Function calling specifically, since there is an intermediate step that happens on the service side to suggest functions to call, there are more tokens processed than your input prompt.

Unfortunately, this is not something that is documented at the moment considering the intermediate prompt is part of the service itself. There is an open discussion about this on the OpenAI Forums as well, which does include some third-party libraries that have approximated these extra tokens through multiple trials.

As for the discrepancy itself, this is something that I haven't observed myself. The metrics reported were the exact numbers that I see in the API response. It would be best to ensure you don't have others making calls on the same instance and compare the exact metric data that is incorrect (Processed Prompt Tokens -> prompt_tokens; Generated Completion Tokens -> completion_tokens; Processed Inference Tokens -> total_tokens).

If you consistently see incorrect values with you being the only owner, it would be best to open a support ticket to investigate this further.

Answer 2

@Pramod Valavala Thank you for your response, which was helpful, but it doesn't address my question. Unfortunately, I just realized I omitted a crucial detail: the model is used in conversations, so the tokens from previous calls need to be added to the tokens of the new call (as better explained in this post: https://community.openai.com/t/how-can-we-count-the-used-tokens-in-a-conversation/213389).

I resolved the issue by implementing a couple of Python functions that utilize tiktoken.

These are my results from some tests:

Prompt tokens: 31.18k (Azure) // 31.199 (script)

Completion tokens: 1.12k // 1.112

Total tokens: 32.30k // 32.31k

import tiktoken
def tokens_count_for_message(message, encoding):
    """Return the number of tokens used by a single message."""
    tokens_per_message = 3

    num_tokens = 0
    num_tokens += tokens_per_message
    for key, value in message.items():
        if key == "function_call":
            num_tokens += len(encoding.encode(value["name"]))
            num_tokens += len(encoding.encode(value["arguments"]))
        else:
            if key == 'content' or key == 'name':
                num_tokens += len(encoding.encode(value))
           
    return num_tokens

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
    """Return the number of tokens used by a list of messages for both user and assistant."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")

    user_tokens = 0
    assistant_tokens = 0
    for i, message in enumerate(messages):
        # Check if the current message involves a service call
        is_service_call = "assistant" in messages[i]['role']

        # Include tokens from previous messages only when a service call is made
        if is_service_call:
            assistant_tokens += tokens_count_for_message(message, encoding)
            for j in range(i):
                user_tokens += tokens_count_for_message(messages[j], encoding)

        # Count tokens for the current message
        user_tokens += tokens_count_for_message(message, encoding)

    assistant_tokens += 3  # every reply is primed with assistant
    
    return user_tokens, assistant_tokens, user_tokens+assistant_tokens

print(num_tokens_from_messages(messages list, model name))

Pramod Valavala 20,656 Reputation points Microsoft Employee Moderator

2023-11-15T19:53:00.98+00:00

@Simone Gallo Glad to hear that you resolved your issue! For conversations, since you would re-send all messages so far in the next request, they all would be parsed as well, which should be reflected in the response JSON as well.

As mentioned in the link you shared, the usual strategy to reduce the number of tokens for large conversations is to use the model itself to summarize the conversation so far and use that in future requests instead.

Even in the case of function calling, your complete conversation will be used as context when determining if a function call is necessary.
Simone Gallo 0 Reputation points

2023-11-15T21:04:22.0566667+00:00

@Pramod Valavala Yes, this strategy seems to be very interesting. I would like to know, in your opinion, how often should I ask the model to summarize the conversation? Additionally, should the summary be injected with the role of user/assistant or system? Thanks!
Pramod Valavala 20,656 Reputation points Microsoft Employee Moderator

2023-11-15T21:21:50.91+00:00

@Simone Gallo In my opinion, I believe you can do it at every turn of the conversation since summarization can be achieved using cheaper models as well. For example, if your chat conversation is based on GPT 4, the summarization itself can be done with 3.5 Turbo instead.

You can put the summary in the system message itself and ensure you specifically instruct the model that this the summary so far, include more details (can be done during summarization) like a title for the conversation so far, etc.

Do note that this approach might be prone to loss of data since the summary might not include everything that has been said in the conversation so far. So, this is an approach you should experiment with and see if it works.

Other approaches are to simply drop oldest messages or limiting conversations to a fixed limit of messages and resetting the conversation afterwards.

Share via

How can I accurately count tokens used in OpenAI services?

2 answers

Your answer