Hi Harinath J
Hope the pointers from Gowtham CP on reducing computational overhead from Azure OpenAI helped.
Wanted to add few points on top of it.
Azure OpenAI
- We can opt for Global deployment (higher availability compared to standard deployment) or Provision throughput units (dedicated and lower latency) - Deployment types
- Structured outputs or explicitly mentioning in system to keep under certain word limit (lesser the output size, faster the response)
- Use Asynchronous Azure OpenAI Operation along streaming
- Send Simpler queries instead of lengthy complex queries so that you can build context gradually generate answer with minimal time spent.
- We can also optimize temperature along max_token (be cautious for hallucination in answer though)
- opt for multizone deployment and load balance them to make them resilient over region outages and slowness -
Azure AI search
- We also optimize from AI search side by changing chunk size or upgrading
- We can also upgrade AI search tier to avail higher availability and index limits
- Optimize your indexing operation - incremental indexing or AI enrichment skills to index - Choose optimum indexing operation.
- Reduce Top_K and Top_N to show lesser number of results
Python tools
- Do normalize the outputs like rounding or reduce redundant computation.
- Enable Caching or Purge piling up memory after passing the values to next node.
Storage side
- Please use hot tier data for faster data fetching.
- Enable CDN to reduce latency on training data.
- GRS or GZRS storage provide more resiliency towards disasters and outages.
Overall, we should try to reduce computation overhead while increasing the performance of operation and underlying resources.
Hope it helps.
Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.
Thank You.