Troubleshoot issues when you use Azure Cosmos DB Java SDK v4 with API for NoSQL accounts

APPLIES TO: NoSQL

Important

This article covers troubleshooting for Azure Cosmos DB Java SDK v4 only. Please see the Azure Cosmos DB Java SDK v4 Release notes, Maven repository, and performance tips for more information. If you're currently using an older version than v4, see the Migrate to Azure Cosmos DB Java SDK v4 guide for help upgrading to v4.

This article covers common issues, workarounds, diagnostic steps, and tools when you use Azure Cosmos DB Java SDK v4 with Azure Cosmos DB for NoSQL accounts. Azure Cosmos DB Java SDK v4 provides client-side logical representation to access the Azure Cosmos DB for NoSQL. This article describes tools and approaches to help you if you run into any issues.

Start with this list:

  • Take a look at the Common issues and workarounds section in this article.
  • Look at the Java SDK in the Azure Cosmos DB central repo, which is available open source on GitHub. It has an issues section that's actively monitored. Check to see if any similar issue with a workaround is already filed. One helpful tip is to filter issues by the *cosmos:v4-item* tag.
  • Review the performance tips for Azure Cosmos DB Java SDK v4, and follow the suggested practices.
  • Read the rest of this article, if you didn't find a solution. Then file a GitHub issue. If there's an option to add tags to your GitHub issue, add a *cosmos:v4-item* tag.

Capture the diagnostics

Database, container, item, and query responses in the Java V4 SDK have a Diagnostics property. This property records all the information related to the single request, including if there were retries or any transient failures.

The Diagnostics are returned as a string. The string changes with each version as it is improved to better troubleshooting different scenarios. With each version of the SDK, the string might break its format. Don't parse the string to avoid breaking changes.

The following code sample shows how to read diagnostic logs using the Java V4 SDK:

Important

We recommend validating the minimum recommended version of the Java V4 SDK and ensure you're using this version or higher. You can check recommended version here.

Database Operations

CosmosDatabaseResponse databaseResponse = client.createDatabaseIfNotExists(databaseName);
CosmosDiagnostics diagnostics = databaseResponse.getDiagnostics();
logger.info("Create database diagnostics : {}", diagnostics); 

Container Operations

CosmosContainerResponse containerResponse = database.createContainerIfNotExists(containerProperties,
                  throughputProperties);
CosmosDiagnostics diagnostics = containerResponse.getDiagnostics();
logger.info("Create container diagnostics : {}", diagnostics);

Item Operations

// Write Item
CosmosItemResponse<Family> item = container.createItem(family, new PartitionKey(family.getLastName()),
                    new CosmosItemRequestOptions());
        
CosmosDiagnostics diagnostics = item.getDiagnostics();
logger.info("Create item diagnostics : {}", diagnostics);
        
// Read Item
CosmosItemResponse<Family> familyCosmosItemResponse = container.readItem(documentId,
                    new PartitionKey(documentLastName), Family.class);
        
CosmosDiagnostics diagnostics = familyCosmosItemResponse.getDiagnostics();
logger.info("Read item diagnostics : {}", diagnostics);

Query Operations

String sql = "SELECT * FROM c WHERE c.lastName = 'Witherspoon'";
        
CosmosPagedIterable<Family> filteredFamilies = container.queryItems(sql, new CosmosQueryRequestOptions(),
                    Family.class);
        
//  Add handler to capture diagnostics
filteredFamilies = filteredFamilies.handle(familyFeedResponse -> {
    logger.info("Query Item diagnostics through handle : {}", 
    familyFeedResponse.getCosmosDiagnostics());
});
        
//  Or capture diagnostics through iterableByPage() APIs.
filteredFamilies.iterableByPage().forEach(familyFeedResponse -> {
    logger.info("Query item diagnostics through iterableByPage : {}",
    familyFeedResponse.getCosmosDiagnostics());
});

Azure Cosmos DB Exceptions

try {
  CosmosItemResponse<Family> familyCosmosItemResponse = container.readItem(documentId,
                    new PartitionKey(documentLastName), Family.class);
} catch (CosmosException ex) {
  CosmosDiagnostics diagnostics = ex.getDiagnostics();
  logger.error("Read item failure diagnostics : {}", diagnostics);
}

Logging the diagnostics

Java V4 SDK versions v4.43.0 and above support automatic logging of Cosmos Diagnostics for all requests or errors if they meet certain criteria. Application developers can define thresholds for latency (for point (create, read, replace, upsert, patch) or non-point operations (query, change feed, bulk and batch)), request charge and payload size. If the requests exceed these defined thresholds, the cosmos diagnostics for those requests will be emitted automatically.

By default, the Java v4 SDK logs these diagnostics automatically in a specific format. However, this can be changed by implementing CosmosDiagnosticsHandler interface and providing your own custom Diagnostics Handler.

These CosmosDiagnosticsThresholds and CosmosDiagnosticsHandler can then be used in CosmosClientTelemetryConfig object, which should be passed into CosmosClientBuilder while creating sync or async client.

NOTE: These diagnostics thresholds are applied across different types of diagnostics including logging, tracing and client telemetry.

The following code samples show how to define diagnostics thresholds, custom diagnostics logger and use them through client telemetry config:

Defining custom Diagnostics Thresholds

//  Create diagnostics threshold
CosmosDiagnosticsThresholds cosmosDiagnosticsThresholds = new CosmosDiagnosticsThresholds();
//  These thresholds are for demo purposes
//  NOTE: Do not use the same thresholds for production
cosmosDiagnosticsThresholds.setPayloadSizeThreshold(100_00);
cosmosDiagnosticsThresholds.setPointOperationLatencyThreshold(Duration.ofSeconds(1));
cosmosDiagnosticsThresholds.setNonPointOperationLatencyThreshold(Duration.ofSeconds(5));
cosmosDiagnosticsThresholds.setRequestChargeThreshold(100f);

Defining custom Diagnostics Handler

//  By default, DEFAULT_LOGGING_HANDLER can be used
CosmosDiagnosticsHandler cosmosDiagnosticsHandler = CosmosDiagnosticsHandler.DEFAULT_LOGGING_HANDLER;

//  App developers can also define their own diagnostics handler
cosmosDiagnosticsHandler = new CosmosDiagnosticsHandler() {
    @Override
    public void handleDiagnostics(CosmosDiagnosticsContext diagnosticsContext, Context traceContext) {
        logger.info("This is custom diagnostics handler: {}", diagnosticsContext.toJson());
    }
};

Defining CosmosClientTelemetryConfig

//  Create Client Telemetry Config
CosmosClientTelemetryConfig cosmosClientTelemetryConfig =
    new CosmosClientTelemetryConfig();
cosmosClientTelemetryConfig.diagnosticsHandler(cosmosDiagnosticsHandler);
cosmosClientTelemetryConfig.diagnosticsThresholds(cosmosDiagnosticsThresholds);

//  Create sync client
CosmosClient client = new CosmosClientBuilder()
    .endpoint(AccountSettings.HOST)
    .key(AccountSettings.MASTER_KEY)
    .clientTelemetryConfig(cosmosClientTelemetryConfig)
    .buildClient();

Retry design

See our guide to designing resilient applications with Azure Cosmos DB SDKs for guidance on how to design resilient applications and learn which are the retry semantics of the SDK.

Common issues and workarounds

Check the portal metrics

Checking the portal metrics will help determine if it's a client-side issue or if there's an issue with the service. For example, if the metrics contain a high rate of rate-limited requests (HTTP status code 429) which means the request is getting throttled then check the Request rate too large section.

Network issues, Netty read timeout failure, low throughput, high latency

General suggestions

For best performance:

  • Make sure the app is running on the same region as your Azure Cosmos DB account.
  • Check the CPU usage on the host where the app is running. If CPU usage is 50 percent or more, run your app on a host with a higher configuration. Or you can distribute the load on more machines.

Connection throttling

Connection throttling can happen because of either a connection limit on a host machine or Azure SNAT (PAT) port exhaustion.

Connection limit on a host machine

Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections, too. Run the following command.

ulimit -a

The number of max allowed open files, which are identified as "nofile," needs to be at least double your connection pool size. For more information, see the Azure Cosmos DB Java SDK v4 performance tips.

Azure SNAT (PAT) port exhaustion

If your app is deployed on Azure Virtual Machines without a public IP address, by default Azure SNAT ports establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the Azure SNAT configuration.

Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. There are two workarounds to avoid Azure SNAT limitation:

  • Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see Azure Virtual Network service endpoints.

    When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using Virtual Network ACLs.

  • Assign a public IP to your Azure VM.

Can't reach the Service - firewall

ConnectTimeoutException indicates that the SDK can't reach the service. You may get a failure similar to the following when using the direct mode:

GoneException{error=null, resourceAddress='https://cdb-ms-prod-westus-fd4.documents.azure.com:14940/apps/e41242a5-2d71-5acb-2e00-5e5f744b12de/services/d8aa21a5-340b-21d4-b1a2-4a5333e7ed8a/partitions/ed028254-b613-4c2a-bf3c-14bd5eb64500/replicas/131298754052060051p//', statusCode=410, message=Message: The requested resource is no longer available at the server., getCauseInfo=[class: class io.netty.channel.ConnectTimeoutException, message: connection timed out: cdb-ms-prod-westus-fd4.documents.azure.com/101.13.12.5:14940]

If you have a firewall running on your app machine, open port range 10,000 to 20,000, which are used by the direct mode. Also follow the Connection limit on a host machine.

UnknownHostException

UnknownHostException means that the Java framework can't resolve the DNS entry for the Azure Cosmos DB endpoint in the affected machine. You should verify that the machine can resolve the DNS entry or if you have any custom DNS resolution software (such as VPN or Proxy, or a custom solution), make sure it contains the right configuration for the DNS endpoint that the error is claiming can't be resolved. If the error is constant, you can verify the machine's DNS resolution through a curl command to the endpoint described in the error.

HTTP proxy

If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK ConnectionPolicy. Otherwise, you face connection issues.

Invalid coding pattern: Blocking Netty IO thread

The SDK uses the Netty IO library to communicate with Azure Cosmos DB. The SDK has an Async API and uses non-blocking IO APIs of Netty. The SDK's IO work is performed on IO Netty threads. The number of IO Netty threads is configured to be the same as the number of CPU cores of the app machine.

The Netty IO threads are meant to be used only for non-blocking Netty IO work. The SDK returns the API invocation result on one of the Netty IO threads to the app's code. If the app performs a long-lasting operation after it receives results on the Netty thread, the SDK might not have enough IO threads to perform its internal IO work. Such app coding might result in low throughput, high latency, and io.netty.handler.timeout.ReadTimeoutException failures. The workaround is to switch the thread when you know the operation takes time.

For example, take a look at the following code snippet, which adds items to a container (look here for guidance on setting up the database and container.) You might perform long-lasting work that takes more than a few milliseconds on the Netty thread. If so, you eventually can get into a state where no Netty IO thread is present to process IO work. As a result, you get a ReadTimeoutException failure.

Java SDK V4 (Maven com.azure::azure-cosmos) Async API


//Bad code with read timeout exception

int requestTimeoutInSeconds = 10;

/* ... */

AtomicInteger failureCount = new AtomicInteger();
// Max number of concurrent item inserts is # CPU cores + 1
Flux<Family> familyPub =
        Flux.just(Families.getAndersenFamilyItem(), Families.getAndersenFamilyItem(), Families.getJohnsonFamilyItem());
familyPub.flatMap(family -> {
    return container.createItem(family);
}).flatMap(r -> {
    try {
        // Time-consuming work is, for example,
        // writing to a file, computationally heavy work, or just sleep.
        // Basically, it's anything that takes more than a few milliseconds.
        // Doing such operations on the IO Netty thread
        // without a proper scheduler will cause problems.
        // The subscriber will get a ReadTimeoutException failure.
        TimeUnit.SECONDS.sleep(2 * requestTimeoutInSeconds);
    } catch (Exception e) {
    }
    return Mono.empty();
}).doOnError(Exception.class, exception -> {
    failureCount.incrementAndGet();
}).blockLast();
assert(failureCount.get() > 0);

The workaround is to change the thread on which you perform work that takes time. Define a singleton instance of the scheduler for your app.

Java SDK V4 (Maven com.azure::azure-cosmos) Async API

// Have a singleton instance of an executor and a scheduler.
ExecutorService ex  = Executors.newFixedThreadPool(30);
Scheduler customScheduler = Schedulers.fromExecutor(ex);

You might need to do work that takes time, for example, computationally heavy work or blocking IO. In this case, switch the thread to a worker provided by your customScheduler by using the .publishOn(customScheduler) API.

Java SDK V4 (Maven com.azure::azure-cosmos) Async API

container.createItem(family)
        .publishOn(customScheduler) // Switches the thread.
        .subscribe(
                // ...
        );

By using publishOn(customScheduler), you release the Netty IO thread and switch to your own custom thread provided by the custom scheduler. This modification solves the problem. You won't get a io.netty.handler.timeout.ReadTimeoutException failure anymore.

Request rate too large

This failure is a server-side failure. It indicates that you consumed your provisioned throughput. Retry later. If you get this failure often, consider an increase in the collection throughput.

  • Implement backoff at getRetryAfterInMilliseconds intervals

    During performance testing, you should increase load until a small rate of requests get throttled. If throttled, the client application should backoff for the server-specified retry interval. Respecting the backoff ensures that you spend minimal amount of time waiting between retries.

Error handling from Java SDK Reactive Chain

Error handling from Azure Cosmos DB Java SDK is important when it comes to client's application logic. There are different error handling mechanisms provided by reactor-core framework which can be used in different scenarios. We recommend customers to understand these error handling operators in detail and use the ones which fit their retry logic scenarios the best.

Important

We do not recommend using onErrorContinue() operator, as it is not supported in all scenarios. Note that onErrorContinue() is a specialist operator that can make the behaviour of your reactive chain unclear. It operates on upstream, not downstream operators, it requires specific operator support to work, and the scope can easily propagate upstream into library code that didn't anticipate it (resulting in unintended behaviour.). Please refer to documentation of onErrorContinue() for more details on this special operator.

Failure connecting to Azure Cosmos DB Emulator

The Azure Cosmos DB Emulator HTTPS certificate is self-signed. For the SDK to work with the emulator, import the emulator certificate to a Java TrustStore. For more information, see Export Azure Cosmos DB Emulator certificates.

Dependency Conflict Issues

The Azure Cosmos DB Java SDK pulls in many dependencies; generally speaking, if your project dependency tree includes an older version of an artifact that Azure Cosmos DB Java SDK depends on, this may result in unexpected errors being generated when you run your application. If you're debugging why your application unexpectedly throws an exception, it's a good idea to double-check that your dependency tree is not accidentally pulling in an older version of one or more of the Azure Cosmos DB Java SDK dependencies.

The workaround for such an issue is to identify which of your project dependencies brings in the old version and exclude the transitive dependency on that older version, and allow Azure Cosmos DB Java SDK to bring in the newer version.

To identify which of your project dependencies brings in an older version of something that Azure Cosmos DB Java SDK depends on, run the following command against your project pom.xml file:

mvn dependency:tree

For more information, see the maven dependency tree guide.

Once you know which dependency of your project depends on an older version, you can modify the dependency on that lib in your pom file and exclude the transitive dependency, following the example below (which assumes that reactor-core is the outdated dependency):

<dependency>
  <groupId>${groupid-of-lib-which-brings-in-reactor}</groupId>
  <artifactId>${artifactId-of-lib-which-brings-in-reactor}</artifactId>
  <version>${version-of-lib-which-brings-in-reactor}</version>
  <exclusions>
    <exclusion>
      <groupId>io.projectreactor</groupId>
      <artifactId>reactor-core</artifactId>
    </exclusion>
  </exclusions>
</dependency>

For more information, see the exclude transitive dependency guide.

Enable client SDK logging

Azure Cosmos DB Java SDK v4 uses SLF4j as the logging facade that supports logging into popular logging frameworks such as log4j and logback.

For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath.

<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-log4j12</artifactId>
  <version>${slf4j.version}</version>
</dependency>
<dependency>
  <groupId>log4j</groupId>
  <artifactId>log4j</artifactId>
  <version>${log4j.version}</version>
</dependency>

Also add a log4j config.

# this is a sample log4j configuration

# Set root logger level to INFO and its only appender to A1.
log4j.rootLogger=INFO, A1

log4j.category.com.azure.cosmos=INFO
#log4j.category.io.netty=OFF
#log4j.category.io.projectreactor=OFF
# A1 is set to be a ConsoleAppender.
log4j.appender.A1=org.apache.log4j.ConsoleAppender

# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d %5X{pid} [%t] %-5p %c - %m%n

For more information, see the sfl4j logging manual.

OS network statistics

Run the netstat command to get a sense of how many connections are in states such as ESTABLISHED and CLOSE_WAIT.

On Linux, you can run the following command.

netstat -nap

On Windows, you can run the same command with different argument flags:

netstat -abn

Filter the result to only connections to the Azure Cosmos DB endpoint.

The number of connections to the Azure Cosmos DB endpoint in the ESTABLISHED state can't be greater than your configured connection pool size.

Many connections to the Azure Cosmos DB endpoint might be in the CLOSE_WAIT state. There might be more than 1,000. A number that high indicates that connections are established and torn down quickly. This situation potentially causes problems. For more information, see the Common issues and workarounds section.

Common query issues

The query metrics will help determine where the query is spending most of the time. From the query metrics, you can see how much of it's being spent on the back-end vs the client. Learn more on the query performance guide.

Next steps