Sharding models
APPLIES TO: Azure Cosmos DB for PostgreSQL (powered by the Citus database extension to PostgreSQL)
Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called Shards. A shard contains a subset of the data, and together shards form the complete dataset.
Azure Cosmos DB for PostgreSQL offers two types of data sharding, namely row-based and schema-based. Each option comes with its own Sharding tradeoffs, allowing you to choose the approach that best aligns with your application's requirements.
Row-based sharding
The traditional way in which Azure Cosmos DB for PostgreSQL shards tables is the single database, shared schema model also known as row-based sharding, tenants coexist as rows within the same table. The tenant is determined by defining a distribution column, which allows splitting up a table horizontally.
Row-based is the most hardware efficient way of sharding. Tenants are densely packed and distributed among the nodes in the cluster. This approach however requires making sure that all tables in the schema have the distribution column and that all queries in the application filter by it. Row-based sharding shines in IoT workloads and for achieving the best margin out of hardware use.
Benefits:
- Best performance
- Best tenant density per node
Drawbacks:
- Requires schema modifications
- Requires application query modifications
- All tenants must share the same schema
Schema-based sharding
Available with Citus 12.0 in Azure Cosmos DB for PostgreSQL, schema-based sharding is the shared database, separate schema model, the schema becomes the logical shard within the database. Multitenant apps can use a schema per tenant to easily shard along the tenant dimension. Query changes aren't required and the application only needs a small modification to set the proper search_path when switching tenants. Schema-based sharding is an ideal solution for microservices, and for ISVs deploying applications that can't undergo the changes required to onboard row-based sharding.
Benefits:
- Tenants can have heterogeneous schemas
- No schema modifications required
- No application query modifications required
- Schema-based sharding SQL compatibility is better compared to row-based sharding
Drawbacks:
- Fewer tenants per node compared to row-based sharding
Sharding tradeoffs
Schema-based sharding | Row-based sharding | |
---|---|---|
Multi-tenancy model | Separate schema per tenant | Shared tables with tenant ID columns |
Citus version | 12.0+ | All versions |
Extra steps compared to vanilla PostgreSQL | None, only a config change | Use create_distributed_table on each table to distribute & colocate tables by tenant ID |
Number of tenants | 1-10k | 1-1 M+ |
Data modeling requirement | No foreign keys across distributed schemas | Need to include a tenant ID column (a distribution column, also known as a sharding key) in each table, and in primary keys, foreign keys |
SQL requirement for single node queries | Use a single distributed schema per query | Joins and WHERE clauses should include tenant_id column |
Parallel cross-tenant queries | No | Yes |
Custom table definitions per tenant | Yes | No |
Access control | Schema permissions | Schema permissions |
Data sharing across tenants | Yes, using reference tables (in a separate schema) | Yes, using reference tables |
Tenant to shard isolation | Every tenant has its own shard group by definition | Can give specific tenant IDs their own shard group via isolate_tenant_to_new_shard |