Random high response time (30sec+) on Azure App Service
We were experimenting high request time randomly in the day (not especially during high traffic periods).
Some requests took more than 1 minute.
We hosts 4 app service on the same plan (single instance / Standard S3) and the problem seems related to a specific one (the Rest API) which is the most resource consuming.
The app is running on the last .Net 7 framework version using Azure SQL and Azure storage service.
We investigate the type of request involved :
And we didn't found any typical request which might cause the performance issue. It depends. When the server is in idle mod, a few type of requests are concerned (Up to 30 requests are slow down)
If I look into the details of a specific query, I have got this log :
In my application code, I'm just schematically doing this (nothing is done before calling the repo which perform Db requests)
If I look into the information logs, most of the time consumed is between the "Executing Action" log and the first SQL request :
I see there is some CPU spikes during the latency (when we set the Max metric)
Consequently, we investigate the CPU and configured the CPU Profiler on the Application Insight instance. When we look on the blocking request, we have this message :
We did some CPU profiling
99.78% of this request was spent in waiting.
Wait for a resource to be available.
We do not have webjobs or functions running on the same app service plan
We investigate our reverse proxy (Cloudflare), disabled it, we still have the slow requests
We updated all the nuget package of the .Net app
The Memory Level seems fine (around 50-55%)
The outbound connections is between 50-100 average
@Antoine BIDAULT Thanks for reaching here!
As per investigation there is slow dependency execution calls detected.
The app is making TCP calls to the [*.tr1672.francecentral1-a.worker.database.windows.net] endpoint(s) and we detected latency on the remote server in responding to this TCP calls. This may indicate that the remote service is taking longer to respond or there is a network issue between the app and the remote server. Delays encountered in calling the remote end point might end up increasing the overall execution time for the current app.
It's suggested to investigate the remote service to check if it is responding properly or not.
Also, your webapp is currently configured to run on only one instance.
Since you have only one instance you can expect downtime because when the App Service platform is upgraded, the instance on which your web app is running will be upgraded. Therefore, your web app process will be restarted and will experience downtime.
Its recommended to distribute your web app across multiple instances. Scale out and add additional instances. These instances are in different upgrade domains and hence will not be upgraded at the same time. While one worker instance is getting upgraded the other is still active to serve web requests.
Refer: Scale instance count manually or automatically
Also Currently your web app not utilizing Health Check feature. Your Site is running on 1 Workers. You should leverage this feature to avoid downtime.
Health Check feature automatically removes a faulty instance from rotation, thus improving availability. This feature will ping the specified health check path on all instances of your webapp every minute. If an instance does not respond or responds with a failure for by default 10 minutes (WEBSITE_HEALTHCHECK_MAXPINGFAILURES defines the number of pings), the instance is determined to be unhealthy and our service will stop routing requests to it. It is highly recommended for production apps to utilize this feature and minimize any potential downtime caused due to a faulty instance.
The number of pings and how long an unhealthy instance remains in the load balancer are configurable. It is important that the endpoint you configure for the health check to probe must implement logic to check the health of any dependencies that the application relies on. Failing to do so will cause the health check to report healthy even when the application fails due to a dependency failure and have an impact on SLA.
For more information, check the documentation link below:
Let us know.
Thanks a lot for your response.
Could you please, clarify the origin resource behind "*.tr1672.francecentral1-a.worker.database.windows.net" ? Is it our primary Azure SQL database node ? We investigate performance issue on the database, but we didn't see any request taking time. Even if it is called many time during requests, the database seems very healthy.
We just moved our API Rest resource (alliage-app-api) from Window to Linux with a similar configuration. We'll wait and see tomorrow. We'll keep you in touch if the problem persists.
It is hard for us to see the main benefit of moving to multiple instance because Average CPU, Network and Ram seem OK. We'll try it next because it needs a bit of adaptations (moving the memory cache to a distributed cache in order to prevent cache freshness issues).
@Antoine BIDAULT When I read through your question the first thing that popped into my head was some sort of problem connecting to SQL Database. For example, perhaps at the point where your code sends the DELETE to Azure SQL Database, there is some issue/problem establishing the connection which causes extra 30+ second delay, then when the connection succeeds the actual DELETE in SQL executes quickly as expected.
I doubt this is the case, but to illustrate imagine you were using Azure SQL Serverless and the server had auto-paused. The next time your code attempted to connect there would be a delay while the server auto-resumed.
One suggestion might be to enable audit logs on SQL and look to see if you can correlate entries on the SQL side with the delay entries on the app side.
This is very surprising, but the switching to a Linux Instance solved the problem. Today, no customer complain about the random latency. Application insights did not track any request with response time upper than 1 sec. We do not have any more CPU spikes : CPU is used between 5% up to 30% / Memory between 40 to 50%. Everything is now running fine.
We've set up exactly the same instance configuration (Standard3 / Single instance). .Net 7 seems running better on a Linux stack.
@TP We'll keep monitoring the SQL db connection establishing time, maybe the problem on windows was related to this because the latency happened often before an Entity Framework SQL request . On the other hand, we had a look on other requests and it happened sometimes before hitting the Redis Cache. It was randomly blocking the request processing. It was totally random and every clients had long response time exactly at the same time (one time for 2 hours approximately).
Sign in to comment