Hello @Ola Ingvarsson,
It sounds like you're encountering unusually high latency when using the Responses API with the GPT-4.1 model in Sweden Central.
I attempted to reproduce the issue in my environment using the GPT-4.1 model in the Sweden Central region, and it's working as expected without any noticeable latency or delays.
This kind of performance degradation can be influenced by several factors,
Ensure you're operating within the allowed Requests Per Minute (RPM) and Tokens Per Minute (TPM) quotas for GPT-4.1. For the default tier, GPT-4.1 supports up to 1,000 RPM and 1 million TPM. Exceeding these quotas can lead to throttling or delays.
If you're frequently making similar requests, implementing caching can reduce repeated calls to the service and improve response times. You can control caching behavior using the Cache-Control
header. For example:
Cache-Control: max-age=30
This sets the cache validity to 30 seconds. Use directives like no-cache
or no-store
to bypass or disable caching as needed.
Large payloads especially prompt with high token counts can significantly impact response time. Try reducing the input size or limiting max_tokens
in your request.
If latency remains high, consider deploying your model to another region temporarily (e.g., West Europe or North Europe) to compare performance. This helps determine whether the issue is regional or related to your specific deployment.
Leverage tools like Azure Monitor or Application Insights to analyze request latency, identify spikes, and establish a performance baseline. This can help you determine if the issue is systemic or workload specific.
Start by checking the Azure Status Page to verify if there are any known outages or performance issues affecting the Sweden Central region. Latency can often be caused by regional service disruptions or maintenance.
I hope this helps, do let me know if you have further queries.
Thank you!