Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
APPLIES TO:
NoSQL
In Azure Cosmos DB, you might notice slow requests. Delays can happen for multiple reasons, such as request throttling or the way your application is designed. This article explains the different root causes for this problem.
Request throttling is the most common reason for slow requests. Azure Cosmos DB throttles requests if they exceed the allocated request units for the database or container. The SDK has built-in logic to retry these requests. The request rate too large troubleshooting article explains how to check if the requests are being throttled. The article also discusses how to scale your account to avoid these problems in the future.
When you design your application, follow the .NET SDK best practices for the best performance. If your application doesn't follow the SDK best practices, you might get slow or failed requests.
Consider the following when developing your application:
If you need to verify that a database or container exists, don't do so by calling Create...IfNotExistsAsync
or Read...Async
before doing an item operation. The validation should only be done on application startup when it's necessary, if you expect them to be deleted. These metadata operations generate extra latency, have no service-level agreement (SLA), and have their own separate limitations. They don't scale like data operations.
Bulk mode is a throughput optimized mode meant for high data volume operations, not a latency optimized mode; it's meant to saturate the available throughput. If you're experiencing slow requests when using bulk mode make sure that:
async/await
pattern to process all concurrent Tasks and not blocking any async operation.All the responses in the SDK, including CosmosException
, have a Diagnostics
property. This property records all the information related to the single request, including if there were retries or any transient failures.
The diagnostics are returned as a string. The string changes with each version, as it's improved for troubleshooting different scenarios. With each version of the SDK, the string will have breaking changes to the formatting. Don't parse the string to avoid breaking changes. The following code sample shows how to read diagnostic logs by using the .NET SDK:
try
{
ItemResponse<Book> response = await this.Container.CreateItemAsync<Book>(item: testItem);
if (response.Diagnostics.GetClientElapsedTime() > ConfigurableSlowRequestTimeSpan)
{
// Log the response.Diagnostics.ToString() and add any additional info necessary to correlate to other logs
}
}
catch (CosmosException cosmosException)
{
// Log the full exception including the stack trace with: cosmosException.ToString()
// The Diagnostics can be logged separately if required with: cosmosException.Diagnostics.ToString()
}
// When using Stream APIs
ResponseMessage response = await this.Container.CreateItemStreamAsync(partitionKey, stream);
if (response.Diagnostics.GetClientElapsedTime() > ConfigurableSlowRequestTimeSpan || !response.IsSuccessStatusCode)
{
// Log the diagnostics and add any additional info necessary to correlate to other logs with: response.Diagnostics.ToString()
}
The JSON structure has breaking changes with each version of the SDK. This makes it unsafe to be parsed. The JSON represents a tree structure of the request going through the SDK. The following sections cover a few key things to look at.
High CPU utilization is the most common cause for slow requests. For optimal latency, CPU usage should be roughly 40 percent. Use 10 seconds as the interval to monitor maximum (not average) CPU utilization. CPU spikes are more common with cross-partition queries, where the requests might do multiple connections for a single query.
The timeouts include diagnostics, which contain the following, for example:
"systemHistory": [
{
"dateUtc": "2021-11-17T23:38:28.3115496Z",
"cpu": 16.731,
"memory": 9024120.000,
"threadInfo": {
"isThreadStarving": "False",
....
}
},
{
"dateUtc": "2021-11-17T23:38:38.3115496Z",
"cpu": 16.731,
"memory": 9024120.000,
"threadInfo": {
"isThreadStarving": "False",
....
}
},
...
]
cpu
values are over 70 percent, the timeout is likely to be caused by CPU exhaustion. In this case, the solution is to investigate the source of the high CPU utilization and reduce it, or scale the machine to a larger resource size.threadInfo/isThreadStarving
nodes have True
values, the cause is thread starvation. In this case, the solution is to investigate the source or sources of the thread starvation (potentially locked threads), or scale the machine or machines to a larger resource size.dateUtc
time between measurements isn't approximately 10 seconds, it also indicates contention on the thread pool. CPU is measured as an independent task that is enqueued in the thread pool every 10 seconds. If the time between measurements is longer, it indicates that the async tasks aren't able to be processed in a timely fashion. The most common scenario is when your application code is blocking calls over async code.The client application that uses the SDK should be scaled up or out.
HttpResponseStats
are requests that go to the gateway. Even in direct mode, the SDK gets all the metadata information from the gateway.
If the request is slow, first verify that none of the previous suggestions yield the desired results. If it's still slow, different patterns point to different problems. The following table provides more details.
Number of requests | Scenario | Description |
---|---|---|
Single to all | Request timeout or HttpRequestExceptions |
Points to SNAT port exhaustion, or a lack of resources on the machine to process the request in time. |
Single or small percentage (SLA isn't violated) | All | A single or small percentage of slow requests can be caused by several different transient problems, and should be expected. |
All | All | Points to a problem with the infrastructure or networking. |
SLA violated | No changes to application, and SLA dropped. | Points to a problem with the Azure Cosmos DB service. |
"HttpResponseStats": [
{
"StartTimeUTC": "2021-06-15T13:53:09.7961124Z",
"EndTimeUTC": "2021-06-15T13:53:09.7961127Z",
"RequestUri": "https://127.0.0.1:8081/dbs/347a8e44-a550-493e-88ee-29a19c070ecc/colls/4f72e752-fa91-455a-82c1-bf253a5a3c4e",
"ResourceType": "Collection",
"HttpMethod": "GET",
"ActivityId": "e16e98ec-f2e3-430c-b9e9-7d99e58a4f72",
"StatusCode": "OK"
}
]
StoreResult
represents a single request to Azure Cosmos DB, by using direct mode with the TCP protocol.
If it's still slow, different patterns point to different problems. The following table provides more details.
Number of requests | Scenario | Description |
---|---|---|
Single to all | StoreResult contains TransportException |
Points to SNAT port exhaustion, or a lack of resources on the machine to process the request in time. |
Single or small percentage (SLA isn't violated) | All | A single or small percentage of slow requests can be caused by several different transient problems, and should be expected. |
All | All | A problem with the infrastructure or networking. |
SLA violated | Requests contain multiple failure error codes, like 410 |
Points to a problem with the Azure Cosmos DB service or the client machine. |
SLA violated | StorePhysicalAddress are the same, with no failure status code. |
Likely a problem with Azure Cosmos DB. |
SLA violated | StorePhysicalAddress have the same partition ID, but different replica IDs, with no failure status code. |
Likely a problem with Azure Cosmos DB. |
SLA violated | StorePhysicalAddress is random, with no failure status code. |
Points to a problem with the machine. |
For multiple store results for a single request, be aware of the following:
StoreResult
. The SDK retries automatically on multiple different transient failures. The SDK is constantly improved to cover more scenarios.Show the time for the different stages of sending and receiving a request in the transport layer.
ChannelAcquisitionStarted
: The time to get or create a new connection. Connections can be created for numerous reasons such as: The previous connection was closed due to inactivity using CosmosClientOptions.IdleTcpConnectionTimeout, the volume of concurrent requests exceeds the CosmosClientOptions.MaxRequestsPerTcpConnection, the connection was closed due to a network error, or the application is not following the Singleton pattern and new instances are constantly created. Once a connection is established, it is reused for subsequent requests, so this should not impact P99 latency unless the previously mentioned issues are happening.Pipelined
: The time spent writing the request into the TCP socket. Request can only be written on a TCP socket one at a time, a large value indicates a bottleneck on the TCP socket which is commonly associated with locked threads by the application code or large requests size.Transit time
: The time spent on the network after the request was written on the TCP socket. Compare this number to the BELatencyInMs
. If BELatencyInMs
is small, then the time was spent on the network, and not on the Azure Cosmos DB service. If the request failed with a timeout, it indicates how long the client waited without response, and the source is network latency.Received
: The time between the response was received and processed by the SDK. A large value is normally caused by a thread starvation or locked threads.Information about a particular backend server. The SDK can open multiple connections to a single backend server depending upon the number of pending requests and the MaxConcurrentRequestsPerConnection.
inflightRequests
The number of pending requests to a backend server (maybe from different partitions). A high number may lead to more traffic and higher latencies.openConnections
is the total Number of connections open to a single backend server. This can be useful to show SNAT port exhaustion if this number is very high.Information about the particular connection (new or old) the request get's assigned to.
waitforConnectionInit
: The current request was waiting for new connection initialization to complete. This will lead to higher latencies.callsPendingReceive
: Number of calls that was pending receive before this call was sent. A high number can show us that there were many calls before this call and it may lead to higher latencies. If this number is high it points to a head of line blocking issue possibly caused by another request like query or feed operation that is taking a long time to process. Try lowering the CosmosClientOptions.MaxRequestsPerTcpConnection to increase the number of channels.LastSentTime
: Time of last request that was sent to this server. This along with LastReceivedTime can be used to see connectivity or endpoint issues. For example if there are many receive timeouts, Sent time will be much larger than the Receive time.lastReceive
: Time of last request that was received from this serverlastSendAttempt
: Time of the last send attemptrequestSizeInBytes
: The total size of the request sent to Azure Cosmos DBresponseMetadataSizeInBytes
: The size of headers returned from Azure Cosmos DBresponseBodySizeInBytes
: The size of content returned from Azure Cosmos DB"StoreResult": {
"ActivityId": "bab6ade1-b8de-407f-b89d-fa2138a91284",
"StatusCode": "Ok",
"SubStatusCode": "Unknown",
"LSN": 453362,
"PartitionKeyRangeId": "1",
"GlobalCommittedLSN": 0,
"ItemLSN": 453358,
"UsingLocalLSN": true,
"QuorumAckedLSN": -1,
"SessionToken": "-1#453362",
"CurrentWriteQuorum": -1,
"CurrentReplicaSetSize": -1,
"NumberOfReadRegions": 0,
"IsValid": true,
"StorePhysicalAddress": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer92/partitions/a4cb49a8-38c8-11e6-8106-8cdcd42c33be/replicas/1s/",
"RequestCharge": 1,
"RetryAfterInMs": null,
"BELatencyInMs": "0.304",
"transportRequestTimeline": {
"requestTimeline": [
{
"event": "Created",
"startTimeUtc": "2022-05-25T12:03:36.3081190Z",
"durationInMs": 0.0024
},
{
"event": "ChannelAcquisitionStarted",
"startTimeUtc": "2022-05-25T12:03:36.3081214Z",
"durationInMs": 0.0132
},
{
"event": "Pipelined",
"startTimeUtc": "2022-05-25T12:03:36.3081346Z",
"durationInMs": 0.0865
},
{
"event": "Transit Time",
"startTimeUtc": "2022-05-25T12:03:36.3082211Z",
"durationInMs": 1.3324
},
{
"event": "Received",
"startTimeUtc": "2022-05-25T12:03:36.3095535Z",
"durationInMs": 12.6128
},
{
"event": "Completed",
"startTimeUtc": "2022-05-25T12:03:36.8621663Z",
"durationInMs": 0
}
],
"serviceEndpointStats": {
"inflightRequests": 1,
"openConnections": 1
},
"connectionStats": {
"waitforConnectionInit": "False",
"callsPendingReceive": 0,
"lastSendAttempt": "2022-05-25T12:03:34.0222760Z",
"lastSend": "2022-05-25T12:03:34.0223280Z",
"lastReceive": "2022-05-25T12:03:34.0257728Z"
},
"requestSizeInBytes": 447,
"responseMetadataSizeInBytes": 438,
"responseBodySizeInBytes": 604
},
"TransportException": null
}
Contact Azure support.
Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowTraining
Module
Monitor responses and events in Azure Cosmos DB for NoSQL - Training
We will learn to use a rich set of REST response codes returned by Azure Cosmos DB request to help you analyze potential issues.
Certification
Microsoft Certified: Azure Cosmos DB Developer Specialty - Certifications
Write efficient queries, create indexing policies, manage, and provision resources in the SQL API and SDK with Microsoft Azure Cosmos DB.
Documentation
Diagnose and troubleshoot issues when using Azure Cosmos DB .NET SDK
Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues when using .NET SDK.
Troubleshoot Azure Cosmos DB HTTP 408 or request timeout issues with the .NET SDK
Learn how to diagnose and fix .NET SDK request timeout exceptions.
Azure Cosmos DB performance tips for .NET SDK v2
Learn client configuration options to improve Azure Cosmos DB .NET v2 SDK performance.