Relationship between indexes and partitions

alex palmer 1 Reputation point
2021-08-10T14:04:49.097+00:00

Hi

I'm a SQL DBA just looking at Comos DB for a particular solution. I'm trying to get my head around impact of partitioning on indexing.

It's been suggested that we use a unique Id for the partitioning key, effectively creating logical partitions with a single document. Some of the partitioning documents I've read suggest that if you then query on a property that isn't the partitioning key then all partitions need will need to be queried. We anticipate 50 million documents in the system and so, if true, that would be a lot of RU's.

Would the above only be true if that property wasn't indexed? If the "index every property" field is ticked do I not have to worry about it? Have I completely got the wrong end of the stick?

Cheers

Alex

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,699 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Michael Taylor 55,481 Reputation points
    2021-08-10T14:45:04.563+00:00

    There is an excellent training course from MS on CosmosDB partitioning and indexing and how to optimize it here. A summary, from my understanding.

    All data stored in CosmosDB is stored in a partition. Partitions are used to divide up the data into buckets. I like to think of them as "pages of data". A partition is limited to 20GB of data. Queries within a partition are generally fast and have low RU costs (which is what you pay for). Therefore in a perfect world your queries will only ever pull data from a single partition. Therefore it is important to choose a partition that is large enough to have related data you generally care about for a typical query but at the same time will not grow beyond the partition size limit. Once you set up a partition you cannot change it without rebuilding the DB so you need to think about it early. Examples of good partition keys for typical scenarios would be orders for a customer (assuming it is less than 20GB) or orders and an order date. Note that partitions are not relational tables and therefore an order would likely include the items in the order as well. So partitions are how the data is broken up.

    An index works like a relational database. Properties may be indexed to speed up queries that filter on them, just like a relational database. You should set up indice on properties that you will commonly filter against. Indice can be adjusted over the life of the database. Indice speed up queries when filtering on them but slow down writes so the more indice you define the faster your queries will run (when filtered) but the slower your writes will take. Therefore you should index only on the properties that you expect to include in filters. Examples might include customer IDs and itemIDs from orders.

    As mentioned earlier you pay for the RUs you use. Queries that have to scan more data (because of bad indice) take more RUs and therefore cost more. Queries that have to pull data across partitions take more RUs as well. In a perfect world none of your queries will filter on unindexed properties nor will they pull data across partitions. In reality this is unlikely so you should strive for having your most commonly used queries to return results from the same partition (example querying by customer) and only filtering on indexed properties.

    As a final note if you're using a partition based upon order ID then an index on order ID isn't needed because the partition is already "filtered" by it. So there is some overlap in filtering but they ultimately serve slightly different purposes.

    0 comments No comments

  2. alex palmer 1 Reputation point
    2021-08-10T14:59:06.94+00:00

    Thanks for that, that is kind of my understanding.

    Presumably if a property is indexed then the partitionId\itemID is returned so the database engine can go straight to the correct partition, in the same way that a sql index would return a primary key or RID?


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.