Cost Optimization

1. Effective Utilization of PTUs

The Azure OpenAI Sizing tool helps the enterprises to plan their Azure OpenAI (AOAI) capacity based on their requirements. They can get more predictable performance on AOAI by procuring Provisioned Throughput Units (PTUs) which necessitate advance payment and reservation of AOAI quotas. However, if this reserved capacity remains underutilized, it can lead to inefficient allocation of resources and financial overhead.

To mitigate this inefficiency, the following approaches can be used:

Using spillover strategy to control costs: Implementing a spillover strategy allows enterprises to first utilize their pre-purchased PTUs before routing excess traffic to Pay-As-You-Go (PAYG) endpoints. With this approach, the PTU capacity can be lower than the peak capacity required enabling a lower PTU capacity to be used. This technique is elaborated here.

Effective consumption of PTUs around the clock: By separating the consumers into real-time and batch (scheduled/on-demand) and applying the monitoring approaches discussed above, PTU utilization can be orchestrated. In this orchestration, the batch consumers consume PTUs only when the PTU endpoint is underutilized.

2. Tracking Resource Consumption at Consumer Level

In a large enterprise setup, operational costs are shared among different business units through a charge-back model. For GenAI resources, this tracking involves the following actions:

  • Measuring consumption per consumer for both at PTU (Reserved capacity) and TPMs (Pay-as-you-go) quota
  • Enabling the Business Units (BU) with transparent cost reporting, quota allocation vs consumed reporting and cost attribution functionalities.

In the realm of AOAI, the approach for consumption tracking depends on mode of interaction with AOAI services.

Batch Processing Mode:

The batch processing mode involves the following steps:

  • Sending a set of inputs all at once
  • Receiving the outputs after the model has processed the entire batch

In this mode, the usage information returned as part of the response body contains the total number of tokens consumed in processing that request.

An example of the usage payload from Azure OpenAI completions endpoint:

"usage": {
  "prompt_tokens": 14,
  "completion_tokens": 436,
  "total_tokens": 450
  }

Using the techniques as discussed in the Monitoring section, the GenAI gateway can be configured to parse and record this payload at a consumer level. Consumer level payload information can be aggregated to build a view of token consumption over a specific time interval for each consumer.

Streaming Mode:

In the streaming mode, the AOAI will not return the usage statistics as part of the response block. If we need to count the tokens, then the following approach can be applied.

  • Measure prompt tokens: The number of prompt tokens has to be calculated from the request by using a library like tiktoken.

  • Measure the completion tokens: The number of the events in the stream should represent the number of the tokens in the response, so just count them while iterating and streaming the response.

Total tokens are the sum of the prompt and completion tokens. The total tokens count still will be an approximation because there is no guarantee that each chunk of the response will be only one token.