How to improve response time of Phi-3-medium-128k serverless API?

Rithika Chowta 0 Reputation points
2024-07-16T07:38:52.1633333+00:00

I have deployed the Phi-3-medium-128k model using Azure AI Studio (serverless deployment). I am using the v1/chat/completions API to get chat completions and I am streaming the response. The time to first token is quite high, ~15 secs for an average input length of 3000 token. These are some of the config parameters I am using:

 "temperature": 0.0,
 "max_tokens": 1000,
 "top_p": 1.0

Is the latency supposed to be this high? How can I improve it?

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,634 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 19,706 Reputation points
    2024-07-16T08:29:35.9433333+00:00

    I am not expert in this matter but this what I found so far :

    • You can try sending periodic requests to keep the model "warm" and reduce cold start times. This can help maintain readiness for actual user queries.
    • If possible, try to reduce the input length because longer inputs generally lead to longer processing times. In this case, you can consider summarizing or chunking long inputs.
    • You can also mower the max_tokens value if you don't need 1000 tokens in the response. If you're making multiple requests, consider batching them together. This can be more efficient than sending individual requests.
    • If your use case allows, consider using a smaller model which might have faster inference times. Evaluate if a less complex model could still meet your needs.
    • If you're using a serverless deployment, switching to a dedicated deployment might offer more consistent and potentially faster response times. This could provide more predictable performance.
    • If available, try increasing the compute resources allocated to your deployment. More powerful hardware can potentially speed up processing.
    0 comments No comments