How to improve response time of Phi-3-medium-128k serverless API?

Question

I have deployed the Phi-3-medium-128k model using Azure AI Studio (serverless deployment). I am using the v1/chat/completions API to get chat completions and I am streaming the response. The time to first token is quite high, ~15 secs for an average input length of 3000 token. These are some of the config parameters I am using:

 "temperature": 0.0,
 "max_tokens": 1000,
 "top_p": 1.0

Is the latency supposed to be this high? How can I improve it?

 "temperature": 0.0,
 "max_tokens": 1000,
 "top_p": 1.0

Is the latency supposed to be this high? How can I improve it?

Answer

I am not expert in this matter but this what I found so far :

You can try sending periodic requests to keep the model "warm" and reduce cold start times. This can help maintain readiness for actual user queries.
If possible, try to reduce the input length because longer inputs generally lead to longer processing times. In this case, you can consider summarizing or chunking long inputs.
You can also mower the max_tokens value if you don't need 1000 tokens in the response. If you're making multiple requests, consider batching them together. This can be more efficient than sending individual requests.
If your use case allows, consider using a smaller model which might have faster inference times. Evaluate if a less complex model could still meet your needs.
If you're using a serverless deployment, switching to a dedicated deployment might offer more consistent and potentially faster response times. This could provide more predictable performance.
If available, try increasing the compute resources allocated to your deployment. More powerful hardware can potentially speed up processing.

Share via

How to improve response time of Phi-3-medium-128k serverless API?

1 answer