Optimizing Azure on your data for Japanese text

shota sato 20 Reputation points

I have setup RAG with Azure on my data for a Japanese Q&A data,

but the response time is slow, taking about 15 seconds.


  • opeai model: gpt-4-32k
  • opeai embedding model: text-embedding-ada-002
  • Use vector search: Yes
  • Data source: Azure Blob Storage

Converted .csv with Q&A to .txt and stored in the same blob.

Only the file format is converted. The contents were not edited at all.

All content is in Japanese.

  • csv A: 510 lines
  • csv B: 55 lines
  • csv C: 205 lines

Header of csv.

  • menu category1 category2 category3 category4 category5 question answer page_url


  • How to improve accuracy in Azure on your data
  • Other way than Azure on your data

The data is in Japanese.

Translated with DeepL.com (free version)

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,470 questions
{count} votes

Accepted answer
  1. navba-MSFT 18,905 Reputation points Microsoft Employee

    @shota sato Welcome to Microsoft Q&A Forum, Thank you for posting your query here!


    If you are using GPT4 model then latency is expected considering that gpt-4 has more capacity than the gpt-3.5 version.


    As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service. Instead, the SLA primarily focuses on availability, which is maintained at a 99.9% level. This means that the emphasis is on ensuring the service is accessible and operational rather than guaranteeing specific performance metrics.


    Action Plan for latency (optimize response time):

    This article talks about Azure OpenAI service about improving the latency performance. Here are some of the best practices to lower latency:

    • Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
    • Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
    • Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
    • Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready. User's image
    • Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.




    Action Plan (for optimizing the response accuracy):
    This article can be followed to fine-tune the Azure Open AI response:


    • You should be able to clearly articulate a specific use case for fine-tuning and identify the model you hope to fine-tune.
    • Good use cases for fine-tuning include steering the model to output content in a specific and customized style, tone, or format, or scenarios where the information needed to steer the model is too long or complex to fit into the prompt window.



    Please let me know if you have any follow-up questions. I would be happy to answer it. Awaiting your reply.

    0 comments No comments

0 additional answers

Sort by: Most helpful