Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
This article provides a Databricks recommended notebook example for benchmarking an LLM endpoint. It also includes a brief introduction to how Databricks performs LLM inference and calculates latency and throughput as endpoint performance metrics.
LLM inference on Databricks measures tokens per second for provisioned throughput mode for Foundation Model APIs. See What do tokens per second ranges in provisioned throughput mean?.
You can import the following notebook into your Databricks environment and specify the name of your LLM endpoint to run a load test.
LLMs perform inference in a two-step process:
Most production applications have a latency budget, and Databricks recommends you maximize throughput given that latency budget.
Databricks divides LLM inference into the following sub-metrics:
Based on these metrics, total latency and throughput can be defined as follows:
On Databricks, LLM serving endpoints are able to scale to match the load sent by clients with multiple concurrent requests. There is a trade-off between latency and throughput. This is because, on LLM serving endpoints, concurrent requests can be and are processed at the same time. At low concurrent request loads, latency is the lowest possible. However, if you increase the request load, latency might go up, but throughput likely also goes up. This is because two requests worth of tokens per second can be processed in less than double the time.
Therefore, controlling the number of parallel requests into your system is core to balancing latency with throughput. If you have a low latency use case, you want to send fewer concurrent requests to the endpoint to keep latency low. If you have a high throughput use case, you want to saturate the endpoint with lots of concurrency requests, since higher throughput is worth it even at the expense of latency.
The previously shared benchmarking example notebook is Databricks’ benchmarking harness. The notebook displays the latency and throughput metrics, and plots the throughput versus latency curve across different numbers of parallel requests. Databricks endpoint autoscaling is based on a “balanced” strategy between latency and throughput. In the notebook, you observe that as more concurrent users are querying the endpoint at the same time latency goes up as well as throughput.
More details on the Databricks philosophy about LLM performance benchmarking is described in the LLM Inference Performance Engineering: Best Practices blog.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register today