High Partition Utilization & Potential Hot Partition Issue in Serverless CosmosDB ( High Normalized RUs AVG consumption alert )

Nizar 0 Reputation points
2025-02-19T14:49:33.3966667+00:00

We are experiencing hot partition issues in our Azure CosmosDB (serverless mode). Some partitions are consistently at 100% utilization, while others remain at 0%. This is leading to potential performance bottlenecks and repeated alerts.

Key Details:

  • CosmosDB Mode: Serverless
  • Impact:
    • High utilization on specific partitions (100%).
    • Triggered the alert multiple times due to exceeding the 80% threshold, reaching values like 82.57% and higher.

-How can we mitigate hot partitions in serverless CosmosDB?

-How can we fix this issue ?

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,902 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Mallaiah Sangi 1,145 Reputation points Microsoft External Staff Moderator
    2025-02-19T21:42:43.7433333+00:00

    Hi @ Nizar,

    Thanks for the Question and using Microsoft Q&A platform.

    As per my understanding your Question is, you are experiencing the performance issues in Potential Hot Partition in Serverless CosmosDB.

    In Azure Cosmos DB, high partition utilization and potential hot partition issues typically occur when a specific partition key is receiving more requests than other partitions, leading to imbalanced resource usage. This is especially problematic in serverless models where Request Units (RUs) are dynamically allocated, as the system may need to scale quickly to handle spikes in resource consumption.

    The challenge of hot partitions

    When one partition in a distributed system receives many more read and write requests than others, it’s called a hot partition. This imbalance can create a performance bottleneck, causing latency issues and impacting system performance.

    To avoid this issue, use a composite partition key—in this example, by combining MetricType and EventTime. With this dual-key setup, you write data related to a certain metric type to a specific partition, spreading the workload more equitably. This strategy helps you maintain a balanced workload distribution across the system.

    Managing data distribution in Cosmos DB

    Inefficient data distribution increases costs, as hot partitions consume excessive RUs. Reduced availability is a risk if you don’t adequately distribute data and replicate it across regions, impacting resilience during regional outages. Workload skew and underutilization can also lead to wasted resources.

    To avoid these issues, you should try to ensure even data distribution, avoid hot partitions, and optimize resource utilization in Cosmos DB. These strategies allow you to maintain optimal performance, scalability, availability, and cost-efficiency within your database.

    Practical solutions and best practices for effective partitioning

    1. Reevaluate Partition Key Strategy

    • Skewed Distribution: If you're seeing a high consumption of RUs on certain partitions, your partition key may be improperly distributing traffic. Consider whether your current partition key is causing traffic to concentrate on a few partitions rather than distributing it evenly.
    • Adjust Partition Key: If possible, choose a partition key that more evenly distributes the load. A good partition key should have a high cardinality and distribute traffic uniformly across partitions. For example, avoid using monotonically increasing values like timestamps or GUIDs that can lead to "hot" partitions.
    • Composite Partition Keys: If the traffic is related to multiple dimensions (e.g., region and customer ID), using a composite partition key might help distribute load more evenly.

    2. Scale the Throughput (if applicable)

    • In serverless mode, throughput automatically scales based on demand, but it might still be necessary to ensure that your database isn't experiencing spikes in RU consumption. Keep an eye on:
      • Throughput bottlenecks: If your partition's traffic outgrows the serverless model's dynamic capacity, you may need to scale up to a provisioned throughput model for better control over RUs.
      • For provisioned throughput accounts, consider adjusting the RU settings, though with serverless accounts, Cosmos DB will handle this scaling for you.

    3. Leverage Data Migration (if needed)

    • Data Redistribution: If you're finding that the current partition key doesn't work well due to uneven load, and you're unable to adjust traffic patterns, you may need to migrate data to a new partition key. While this can be disruptive, in some cases it’s necessary to ensure better scaling and reduce hot partition issues.

    4. Use Cosmos DB’s "Hot Partition" Alerting

    • You can enable alerts on the Cosmos DB metrics related to RU consumption and partition utilization. Set thresholds for average RU consumption or Request latency to receive notifications when the system is approaching or exceeding normal operational ranges.

    5. Serverless Limitations and Best Practices

    • Bursting and Throttling: Cosmos DB in serverless mode is optimized for smaller-scale workloads and can face issues when sudden bursts of traffic occur. If your usage is consistently high, consider switching to a provisioned throughput model, which gives you more control over RU allocations.
    • Use Throughput Scaling and Autoscale: If you're on provisioned throughput, consider using Autoscale to automatically adjust your RU settings based on traffic fluctuations.

    6. Consider Using Multi-Region Replication (for Global Distribution)

    • If your workload is heavily distributed across regions, consider enabling multi-region writes and read replicas. This can help with the distribution of the load and reduce the chance of a hot partition by allowing Cosmos DB to automatically replicate and balance traffic across regions.

    Key Recommendations:

    • Partition Key Choice: Re-evaluate your partition key. Ensure it’s a well-distributed value that can balance traffic.
    • Query Design: Optimize queries to avoid unnecessary scans and inefficient access patterns.
    • Monitor Traffic: Use Cosmos DB metrics and logs to track partition usage and consumption.
    • Review Serverless Model: If traffic spikes are causing issues, consider moving to provisioned throughput for better control over performance.

     If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

     

           

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.