Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
In a mission-critical architecture, any state must be stored outside the compute as much as possible. The choice of technology is based on these key architectural characteristics:
Characteristics | Considerations |
---|---|
Performance | How much compute is required? |
Latency | What impact will the distance between the user and the data store have on latency? What is the desired level of consistency with tradeoff on latency? |
Responsiveness | Is the data store required to be always available? |
Scalability | What's the partitioning scheme? |
Durability | Is the data expected to long lasting? What is the retention policy? |
Resiliency | In case of a failure, is the data store able to fail over automatically? What measures are in place to reduce the failover time? |
Security | Is the data encrypted? Can the datastore be reached over public network? |
In this architecture, there are two data stores:
Database
Stores related to the workload. It's recommended that all state is stored globally in a database separated from regional stamps. Build redundancy by deploying the database across regions. For mission-critical workloads, synchronizing data across regions should be the primary concern. Also, in case of a failure, write requests to the database should still be functional.
Data replication in an active-active configuration is highly recommended. The application should be able to instantly connect with another region. All instances should be able to handle read and write requests.
Message broker
The only stateful service in the regional stamp is the message broker, which stores requests for a short period. The broker serves the need for buffering and reliable messaging. The processed requests are persisted in the global database.
For other data considerations, see Misson-critical guidance in Well-architected Framework: Data platform considerations.
This architecture uses Azure Cosmos DB for NoSQL. This option is chosen because it provides the most features needed in this design:
Multi-region write
Multi-region write is enabled with replicas deployed to every region in which a stamp is deployed. Each stamp can write locally and Azure Cosmos DB handles data replication and synchronization between the stamps. This capability significantly lowers latency for geographically distributed end-users of the application. The Azure Mission-Critical reference implementation uses multi-master technology to provide maximum resiliency and availability.
Zone redundancy is also enabled within each replicated region.
For details on multi-region writes, see Configure multi-region writes in your applications that use Azure Cosmos DB.
Conflict management
With the ability to perform writes across multiple regions comes the necessity to adopt a conflict management model as simultaneous writes can introduce record conflicts. Last Writer Wins is the default model and is used for the Mission Critical design. The last writer, as defined by the associated timestamps of the records wins the conflict. Azure Cosmos DB for NoSQL also allows for a custom property to be defined.
Query optimization
A general query efficiency recommendation for read-heavy containers with a high number of partitions is to add an equality filter with the itemID identified. An in-depth review of query optimization recommendations can be found at Troubleshoot query issues when using Azure Cosmos DB.
Backup feature
It's recommended that you use the native backup feature of Azure Cosmos DB for data protection. Azure Cosmos DB backup feature supports online backups and on-demand data restore.
Note
Most workloads aren't purely OLTP. There's an increasing demand for real-time reporting, such as running reports against the operational system. This is also referred to as HTAP (Hybrid Transactional and Analytical Processing). Azure Cosmos DB supports this capability via Azure Synapse Link for Azure Cosmos DB.
Data model should be designed such that features offered by traditional relational databases aren't required. For example, foreign keys, strict row/column schema, views, and others.
The workload has these data access characteristics:
FeedIterator
with limit on number of results is used.Azure Cosmos DB is configured as follows:
Consistency level is set to the default Session consistency because it's the most widely used level for single region and globally distributed applications. Weaker consistency with higher throughput isn't needed because of the asynchronous nature of write processing and doesn't require low latency on database write.
Note
The Session consistency level offers a reasonable tradeoff for latency, availability and consistency guarantees for this specific application. It's important to understand that Strong consistency level isn't available for multi-master write databases.
Partition key is set to /id
for all collections. This decision is based on the usage pattern, which is mostly "writing new documents with GUID as the ID" and "reading wide range of documents by IDs". Providing the application code maintains its ID uniqueness, new data is evenly distributed into partitions by Azure Cosmos DB, enabling infinite scale.
Indexing policy is configured on collections to optimize queries. To optimize RU cost and performance, a custom indexing policy is used. This policy only indexes the properties that are used in query predicates. For example, the application doesn't use the comment text field as a filter in queries. It was excluded from the custom indexing policy.
Here's an example from the implementation that shows indexing policy settings using Terraform:
indexing_policy {
excluded_path {
path = "/description/?"
}
excluded_path {
path = "/comments/text/?"
}
included_path {
path = "/*"
}
}
For information about connection from the application to Azure Cosmos DB in this architecture, see Application platform considerations for mission-critical workloads
Mission critical systems often utilize messaging services for message or event processing. These services promote loose coupling and act as a buffer that insulates the system against unexpected spikes in demand.
The following are design considerations and recommendations for Azure Service Bus premium and Azure Event Hubs in a mission critical architecture.
The messaging system must be able to handle the required throughput (as in MB per second). Consider the following:
Azure Service Bus premium tier is the recommended solution for high-value messages for which processing must be guaranteed. The following are details regarding this requirement when using Azure Service Bus premium:
To ensure that messages are properly transferred to and accepted by the broker, message producers should use one of the supported Service Bus API clients. Supported APIs will only return successfully from a send operation if the message was persisted on the queue/topic.
To ensure messages on the bus are processed, you should use PeekLock receive mode. This mode enables at-least once processing. The following outlines the process:
Because messages can potentially be processed more than one time, message handlers should be made idempotent.
In RFC 7231, the Hypertext Transfer Protocol states, "A ... method is considered idempotent if the intended effect on the server of multiple identical requests with that method is the same as the effect for a single such request."
One common technique of making message handling idempotent is to check a persistent store, like a database, if the message has already been processed. If it has been processed, you wouldn't run the logic to process it again.
The message broker must be available for producers to send messages and consumers to receive them. The following are details regarding this requirement:
Note
Azure Service Bus Geo-disaster recovery only replicates metadata across regions. This feature doesn't replicate messages.
The messaging system acts as a buffer between message producers and consumers. There are key indicator types that you should monitor in a mission-critical system that provide valuable insights described below:
The health of the messaging system must be considered in the health checks for a mission critical application. Consider the following factors:
Deploy the reference implementation to get a full understanding of the resources and their configuration used in this architecture.
Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowTraining
Module
Introduction to Azure messaging services - Training
Learn how Azure storage queues, Event Hubs, Event Grid, and service bus can improve your communication between devices.