Data Consistency Primer
Cloud applications typically use data that is dispersed across data stores. Managing and maintaining data consistency in this environment can become a critical aspect of the system, particularly in terms of the concurrency and availability issues that can arise. You frequently need to trade strong consistency for availability. This means that you may need to design some aspects of your solutions around the notion of eventual consistency and accept that the data that your applications use might not be completely consistent all of the time.
Managing Data Consistency
Every web application and service uses data. This data is frequently required to help users and organizations make business decisions, and therefore it may be important that this data accurately represents the most current information available and that it is consistent. Data consistency implies that all instances of an application are presented with the same set of data values all of the time. This approach is sometimes referred to as strong data consistency.
In the world of relational databases, consistency is often enforced by transactional models that use locks to prevent concurrent application instances from modifying the same data at the same time. In a strongly consistent system, the locks also block concurrent requests to query data, but many relational databases enable an application to relax this rule and provide access to a copy of the data that reflects the state it was in before the update started. Many applications that store data in non-relational databases, flat files, or other structures follow a similar strategy, known as pessimistic locking. An application instance locks data while it is being modified, and then releases the lock when the update is complete.
In a modern cloud application, the data is likely to be partitioned across data stores hosted at different sites, some of which could be dispersed over a wide geography. This can occur for a variety of reasons: to improve scalability by balancing the load across multiple computers, to improve response time by co-locating data close to the users and services that access it, or to improve availability by replicating data across different sites.
Maintaining data consistency across distributed data stores can be a significant challenge. The issue is that strategies such as serialization and locking only work well if all application instances share the same data store, and the application is designed to ensure that the locks are very short-lived. However, if data is partitioned or replicated across different data stores, locking and serializing data access to maintain consistency can become an expensive overhead that impacts the throughput, response time, and scalability of a system. Therefore, most modern distributed applications do not lock the data that they modify, and they take a rather more relaxed approach to consistency, known as eventual consistency.
For information about distributing data across remote locations, co-locating data, and replicating and synchronizing data, see Data Partitioning Guidance and Data Replication and Synchronization Guidance.
The following sections provide more information on strong consistency and eventual consistency, and the issues that surround these different approaches to maintaining data consistency in a distributed environment such as the cloud.
In the strong consistency model, all changes are atomic. If a transaction updates multiple data items, the transaction is not allowed to complete until either all of the changes have been made successfully, or (in the event of a failure) they have all been undone. In the time between a transaction starting and completing, other concurrent transactions may not be able access any of the data that has been modified; they will be blocked. If data is being replicated, a transaction that implements strong consistency may not be allowed to complete until every copy of each item that has changed has been successfully updated.
The aim of the strong consistency model is to minimize the chance that an application instance might be presented with an inconsistent view of the data. The cost of implementing this model is the impact it has on the availability, performance, and scalability of the resulting solution. In a distributed environment, if the data stores holding the data affected by a transaction are geographically remote from each other, network latency could adversely impact the performance of such transactions and result in concurrent access to data being blocked for an extended period. If a network failure renders one or more of the data stores inaccessible during a transaction, an application updating data in a system that implements strong consistency may be blocked until every data store becomes accessible again.
Additionally, in a distributed environment such as the cloud, implementing strong consistency is not tolerant of the types of failure that may occur. For example, it may not be possible to roll back a transaction and release the resources that it holds if a component participating in the transaction has stopped responding due to a long-lasting network outage. In this case, it will be necessary to resolve the situation through other means, such as manually reconciling the data.
Many forms of data storage used by cloud-based applications do not support strong consistency across different data stores. For example, when using Microsoft Azure Storage it is not possible to implement transactions that span multiple blob or table stores.
In a cloud application, you should implement strong consistency only where it is absolutely necessary. For example, if an application updates multiple items that are located within the same data store, the benefits of strong consistency may outweigh the disadvantages because data is likely to be locked only for a very short period. However, if the items to be updated are dispersed across a network, it may be more appropriate to relax the requirement for strong consistency.
In a system that implements strong consistency but also replicates data to remote locations, it may be appropriate to propagate changes to replicas outside the scope of a strongly consistent transaction. Some level of transient inconsistency is almost inevitable while replicas are updated—but the data will eventually become consistent after the synchronization between replicas has completed. For more information, see the Data Replication and Synchronization Guidance.
An alternative approach for maintaining strong consistency across replicated data that is frequently implemented by highly scalable NoSQL databases is to use read and write quorums and versioning. This approach avoids locking data, at the expense of some additional complexity in the processes that read and write data. For more information see the section “Improving Consistency” in Chapter 1, “Data Storage for Modern High-Performance Business Applications” of the guide Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence on MSDN.
Not all data in an application has to be treated in the same way with respect to consistency. An application may implement different strategies for handling consistency across different data sets. The decision over which strategy and consistency model to use for any given dataset should be based on the business requirements of the application.
Eventual consistency is a rather more pragmatic approach to data consistency. In many cases, strong consistency is not actually required as long all the work performed by a transaction is completed or rolled back at some point, and no updates are lost. In the eventual consistency model, data update operations that span multiple sites can ripple through the various data stores in their own time, without blocking concurrent application instances that access the same data.
One of the drives for eventual consistency is that distributed data stores are subject to the CAP Theorem. This theorem states that a distributed system can implement only two of the three features (Consistency, Availability, and Partition Tolerance) at any one time. In practice, this means that you can either:
- Provide a consistent view of distributed (partitioned) data at the cost of blocking access to that data while any inconsistencies are resolved. This may take an indeterminate time, especially in systems that exhibit a high degree of latency or if a network failure causes loss of connectivity to one or more partitions.
- Provide immediate access to the data at the risk of it being inconsistent across sites. Traditional database management systems focus on providing strong consistency, whereas cloud-based solutions that utilize partitioned data stores are typically motivated by ensuring higher availability, and are therefore more oriented towards eventual consistency.
Eventual consistency is unlikely to be specified as an explicit requirement of a distributed system. Instead it is often a result of implementing a system that must exhibit scalability and high availability, which precludes most common strategies for providing strong consistency.
An application instance may see a view of a data item affected by an operation in the state it is in while the operation is in flight, and this view may be temporarily inconsistent. Depending on the requirements of the system, the developer might need to design applications to detect and handle such inconsistencies, and then take steps to resolve them if necessary.
Eventual consistency also impacts data consistency when using caching. If the data in the remote data store changes, all copies cached by applications will be most likely be out of date. Configuring a cache expiration policy that prevents cached data from becoming too stale, and implementing techniques such as the Cache Aside pattern, can help to reduce the chances of inconsistencies occurring. However, these approaches are unlikely to completely eliminate inconsistencies in cached data, and it is important that applications that use caching as an optimization strategy can handle these inconsistencies.
It is worth bearing in mind that an application may not actually require data to be consistent all of the time. For example, in a typical ecommerce web application that enables a user to browse and purchase goods, any stock levels presented to a user are likely to be static values determined when the details for a stock item are queried. If another concurrent user purchases the same item, the stock level in the system will decrease but this change will probably not need to be reflected in the data displayed to the first user. If the stock level drops to zero and the first user attempts to purchase the item, the system could either alert the user that the item is now out of stock, or place the item on back order and inform the user that the delivery time may be extended.
Considerations for Implementing Eventual Consistency
Eventual consistency is often the preferred model for managing distributed data in a cloud environment, but there are many issues that you must consider if you follow this model. These issues are best summarized by using an example. Figure 1 shows a simple ecommerce application that could benefit from following the eventual consistency approach.
Figure 1 - A distributed transaction spanning three heterogeneous data sources
When a customer places an order, the application instance performs the following operations across a collection of heterogeneous data stores held in various locations:
- Update the stock level of the item ordered.
- Record the details of the order.
- Verify payment details for the order.
In some cases a data store can be an external service, such as the payment system shown in Figure 1.
Although these operations comprise a logical transaction, attempting to implement strong transactional consistency in this scenario is likely to be impractical. Instead, implementing the order process as an eventually consistent**series of steps, where each step in the process is essentially an autonomous operation, is a much more scalable solution. While these steps are progressing, the state of the overall system is inconsistent. For example, after the stock level has been updated but before the details of the order have been recorded the system has temporarily lost some stock. However, when all the steps have been completed, the system returns to a consistent state and all stock items can be accounted for.
Despite the fact that implementing eventual consistency in this example appears to be conceptually quite simple, the developer must ensure that the system does eventually become consistent. In other words, the application is responsible for guaranteeing either that all three steps in the order process complete, or determining the actions to take if any of the steps fail. How you resolve this situation in any given system is inevitably application specific.
For an example showing how you can implement an eventually consistent system spanning different data stores, see Chapter 8, “Building a Polyglot Solution,” in the guide “Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence” on MSDN. The following sections of this guidance also provide some suggestions.
Retrying Failing Steps
In a distributed environment, the inability to complete an operation is often due to some type of temporary error (communication failure is always a possibility.) If such a failure occurs, an application might assume that the situation is transient and simply attempt to repeat the step that failed. Less transient exceptions, such as database or virtual machine failure, may also occur and the remedy might be similar—wait for the system to be recovered and then try the failing operation again. This approach could result in the same step actually being run twice, possibly resulting in multiple updates. It is very difficult to design a solution to prevent this repetition from occurring, but the application should attempt to render such repetition harmless.
One strategy is to design each step in an operation to be idempotent. This means that a step that had previously succeeded can be repeated without actually changing the state of the system. The steps that comprise a business operation are naturally heavily dependent on the business logic of your system, and the way in which you implement them will be heavily influenced by the structure of the data. Defining idempotent steps requires a deep, domain-specific understanding of your system.
Some steps might be naturally idempotent. For example, a step that sets a particular item to a specific value (such as “ZipCode = 11111”) can be repeated many times and the result will always be the same. However, natural idempotency is not always possible. In a system that incorporates services, such as the payment system shown in the ecommerce example, it may be possible to implement some form of artificial idempotency. A common technique is to associate the message sent to the service with a unique identifier. The service can store the identifier for each message it receives locally, and only process a message if the identifier does not match that of a message it received earlier. This technique is known as de-duping (the removal of duplicate messages). This strategy, exemplified by the Idempotent Receiver pattern, depends on the service being able to store message identifiers successfully.
For more information about idempotency, see Idempotency Patterns on Jonathan Oliver’s blog.
Partitioning Data and Using Idempotent Commands
Multiple instances of an application competing to modify the same data at the same time are another common cause of failure to achieve eventual consistency. If possible, you should design your system to minimize these situations. You should try and partition your system to ensure that concurrent instances of an application attempting to perform the same operations simultaneously do not conflict with each other.
Rather than thinking in terms of simple CRUD (create, retrieve, update, and delete) operations, you can structure your system around atomic commands that perform business tasks in an idempotent style. For more information, see the Command and Query Responsibility Segregation pattern. The commands in a CQRS solution are frequently implemented by following the Event Sourcing pattern. Event sourcing performs operations on data by driving tasks from a sequence of events, each of which is recorded in an append-only store.
For detailed information on using CQRS and Event Sourcing to implement eventual consistency, see Reference 4: A CQRS and ES Deep Dive on MSDN.
Implementing Compensating Logic
There may ultimately be situations where the logic in the application determines that an operation cannot or should not be allowed to complete (this could be for a variety of business-specific reasons). In these cases you can implement compensating logic that undoes the work performed by the operation, as described by the Compensating Transaction pattern.
In the ecommerce example shown in Figure 1, as the application performs each step of the order process it can record the tasks necessary to undo this step. If the order process fails, the application can apply the “undo” steps for each step that had previously completed to restore the system to a consistent state. This technique may be complicated by the fact that undoing a step may not be as simple as performing the exact opposite of the original step, and there may be additional business rules that the application must apply. For example, undoing the step that records the details of an order in the document database may not be as straightforward as removing the document. For auditing purposes, it may be necessary to leave the original order document in place but change the status of the order in this document to “cancelled.”
Compensating transactions can be complicated to implement and expensive to perform. You should only use them where they are absolutely necessary.
Related Patterns and Guidance
The following patterns and guidance may also be relevant when managing consistency in a cloud application:
- Compensating Transaction Pattern. This pattern describes how to undo the work performed by a series of steps, which together define an eventually consistent operation, if one or more of the operations fail.
- Command and Query Responsibility Segregation Pattern. This pattern describes how you can segregate the operations that read data from the operations that update data. This pattern can use different models of the same data, and it is necessary to ensure that the information in these models can become consistent.
- Event Sourcing Pattern. This pattern is frequently used with the Command and Query Responsibility Segregation pattern. It can simplify tasks in complex domains; improve performance, scalability, and responsiveness; provide consistency for transactional data; and maintain full audit trails and history that may enable compensating actions.
- Data Partitioning Guidance. In many large-scale applications, data is divided into separate partitions that can be managed and accessed separately. It may be necessary to ensure that the data is consistent across partitions.
- Data Replication and Synchronization Guidance. Data replication and synchronization can help to maximize availability and performance, ensure consistency, and minimize data transfer costs between locations.
- Caching Guidance. Data in application caches can become inconsistent across instances and with the data store that acts as the original source of the data. This guidance describes how caches may support expiration policies that can help to reduce the period during which cached data is inconsistent.
- Cache-Aside Pattern. This pattern describes how to fetch data into a cache on demand, as the data is required by an application. It can be used to good effect to reduce the overhead associated with repeatedly accessing the same data.
- The guide Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence on MSDN.
- The Idempotent Receiver pattern by Gregor Hohpe and Bobby Woolf on the Enterprise Integration Patterns website.
- The article Idempotency Patterns on Jonathan Oliver’s blog.
- Reference 4: A CQRS and ES Deep Dive on MSDN.
- The article Eventually Consistent on the ACM website.
- CAP Theorem on Wikipedia.