Designing resilient applications with Azure Cosmos DB SDKs

2024-08-14

APPLIES TO: NoSQL

When authoring client applications that interact with Azure Cosmos DB through any of the SDKs, it's important to understand a few key fundamentals. This article is a design guide to help you understand these fundamentals and design resilient applications.

Overview

For a video overview of the concepts discussed in this article, see:

Connectivity modes

Azure Cosmos DB SDKs can connect to the service in two connectivity modes. The .NET and Java SDKs can connect to the service in both Gateway and Direct mode, while the others can only connect in Gateway mode. Gateway mode uses the HTTP protocol and Direct mode typically uses the TCP protocol.

Gateway mode is always used to fetch metadata such as the account, container, and routing information regardless of which mode SDK is configured to use. This information is cached in memory and is used to connect to the service replicas.

In summary, for SDKs in Gateway mode, you can expect HTTP traffic, while for SDKs in Direct mode, you can expect a combination of HTTP and TCP traffic under different circumstances (like initialization, or fetching metadata, or routing information).

Client instances and connections

Regardless of the connectivity mode, it's critical to maintain a Singleton instance of the SDK client per account per application. Connections, both HTTP, and TCP, are scoped to the client instance. Most compute environments have limitations in terms of the number of connections that can be open at the same time. When these limits are reached, connectivity is affected.

Distributed applications and networks

When you design distributed applications, there are three key components:

Your application and the environment it runs on.
The network, which includes any component between your application and the Azure Cosmos DB service endpoint.
The Azure Cosmos DB service endpoint.

When failures occur, they often fall into one of these three areas, and it's important to understand that due to the distributed nature of the system, it's impractical to expect 100% availability for any of these components.

Azure Cosmos DB has a comprehensive set of availability SLAs, but none of them is 100%. The network components that connect your application to the service endpoint can have transient hardware issues and lose packets. Even the compute environment where your application runs could have a CPU spike affecting operations. These failure conditions can affect the operations of the SDKs and normally surface as errors with particular codes.

Your application should be resilient to a certain degree of potential failures across these components by implementing retry policies over the responses provided by the SDKs.

Should my application retry on errors?

The short answer is yes. But not all errors make sense to retry on, some of the error or status codes aren't transient. The table below describes them in detail:

Status Code	Should add retry	SDKs retry	Description
400	No	No	Bad request
401	No	No	Not authorized
403	Optional	No	Forbidden
404	No	No	Resource is not found
408	Yes	Yes	Request timed out
409	No	No	Conflict failure is when the ID and partition key provided for a resource on a write operation has been taken by an existing resource or when a unique key constraint has been violated.
410	Yes	Yes	Gone exceptions (transient failure that shouldn't violate SLA)
412	No	No	Precondition failure is where the operation specified an eTag that is different from the version available at the server. It's an optimistic concurrency error. Retry the request after reading the latest version of the resource and updating the eTag on the request.
413	No	No	Request Entity Too Large
429	Yes	Yes	It's safe to retry on a 429. Review the guide to troubleshoot HTTP 429.
449	Yes	Yes	Transient error that only occurs on write operations, and is safe to retry. This can point to a design issue where too many concurrent operations are trying to update the same object in Azure Cosmos DB.
500	No	No	The operation failed due to an unexpected service error. Contact support by filing an Azure support issue.
503	Yes	Yes	Service unavailable

In the table above, all the status codes marked with Yes on the second column should have some degree of retry coverage in your application.

HTTP 403

The Azure Cosmos DB SDKs don't retry on HTTP 403 failures in general, but there are certain errors associated with HTTP 403 that your application might decide to react to. For example, if you receive an error indicating that a Partition Key is full, you might decide to alter the partition key of the document you're trying to write based on some business rule.

HTTP 429

The Azure Cosmos DB SDKs retry on HTTP 429 errors by default following the client configuration and honoring the service's response x-ms-retry-after-ms header, by waiting the indicated time and retrying after.

When the SDK retries are exceeded, the error is returned to your application. Ideally inspecting the x-ms-retry-after-ms header in the response can be used as a hint to decide how long to wait before retrying the request. Another alternative would be an exponential back-off algorithm or configuring the client to extend the retries on HTTP 429.

HTTP 449

The Azure Cosmos DB SDKs retry on HTTP 449 with an incremental back-off during a fixed period of time to accommodate most scenarios.

When the automatic SDK retries are exceeded, the error is returned to your application. HTTP 449 errors can be safely retried. Because of the highly concurrent nature of write operations, it's better to have a random back-off algorithm to avoid repeating the same degree of concurrency after a fixed interval.

Network timeouts and connectivity failures are among the most common errors. The Azure Cosmos DB SDKs are themselves resilient and will retry timeouts and connectivity issues across the HTTP and TCP protocols if the retry is feasible:

For read operations, the SDKs retry any timeout or connectivity related error.
For write operations, the SDKs do not retry because these operations are not idempotent. When a timeout occurs waiting for the response, it's not possible to know if the request reached the service.

If the account has multiple regions available, the SDKs also attempt a cross-region retry.

Because of the nature of timeouts and connectivity failures, these might not appear in your account metrics, as they only cover failures happening on the service side.

It's recommended for applications to have their own retry policy for these scenarios and take into consideration how to resolve write timeouts. For example, retrying on a Create timeout can yield an HTTP 409 (Conflict) if the previous request did reach the service, but it would succeed if it didn't.

Language specific implementation details

For further implementation details regarding a language see:

Do retries affect my latency?

From the client perspective, any retries affect the end to end latency of an operation. When your application P99 latency is being affected, understanding the retries that are happening and how to address them is important.

Azure Cosmos DB SDKs provide detailed information in their logs and diagnostics that can help identify which retries are taking place. For more information, see how to collect .NET SDK diagnostics and how to collect Java SDK diagnostics.

How can I mitigate retry latency?

Depending on the circumstances, in most cases the SDK route requests to either the local region, the write region (in a single-region write scenario), or the first region in the preferred regions list. This prioritization minimizes latency in healthy scenarios by primarily connecting to the nearest or most optimal data center.

However, this prioritization also means that requests which are going to result in failure will always be tried in one specific region first for a given error scenario. If failover to another region is preferred in that scenario, this is typically handled at the infrastructure (traffic manager) layer rather than at the SDK level. Proper setup and configuration of your infrastructure can ensure that traffic is rerouted efficiently during regional outages, thereby mitigating the latency that can come with cross-region retries in an outage scenario. For more detailed information on setting up infrastructure-level failover, you can refer to Azure Traffic Manager documentation. Some SDKs support implementing similar failover strategies directly at the SDK level. For example, see high availability for Java SDK and .NET SDK.

What about regional outages?

The Azure Cosmos DB SDKs cover regional availability and can perform retries on another account regions. Refer to the multiregional environments retry scenarios and configurations article to understand which scenarios involve other regions.

When to contact customer support

Before contacting customer support, go through these steps:

What is the impact measured in volume of operations affected compared to the operations succeeding? Is it within the service SLAs?
Is the P99 latency affected?
Are the failures related to error codes that my application should retry on and does the application cover such retries?
Are the failures affecting all your application instances or only a subset? When the issue is reduced to a subset of instances, it's commonly a problem related to those instances.
Have you gone through the related troubleshooting documents in the above table to rule out a problem on the application environment?

If all the application instances are affected, or the percentage of affected operations is outside service SLAs, or affecting your own application SLAs and P99s, contact customer support.

Next steps

Learn about multiregional environments retry scenarios and configurations
Review the Availability SLAs
Use the latest .NET SDK
Use the latest Java SDK
Use the latest Python SDK
Use the latest Node SDK

Share via