8 - Building a Polyglot Solution
On this page: | Download: |
---|---|
Partitioning Data Across Different Types of Database | Managing Cross-Database Consistency | Handling Consistency in the Business Logic of a Solution - How the Shopping Application Creates an Order, How the OrderProcessor Service Handles an Order, How the Shopping Application Prevents Duplicate Orders | Synchronizing Databases by Using a Batch Process - How the Shopping Application Verifies Product Pricing and Availability | Synchronizing Databases by Using Events | Summary | More Information |
Throughout this book, we have emphasized that the purpose of a NoSQL database is to enable an application to store and retrieve data in a scalable, available, and efficient manner. Most NoSQL databases are geared towards specific forms of data access, and you design the structures that you store in a NoSQL database to optimize the queries that your applications perform. Solutions that are based on a relational database tend to store all of their data in a single repository and depend on the relational database management system to perform the optimization for them. However, the advent of NoSQL databases has meant that many application developers, designers, and architects are now looking to apply the most appropriate means of data storage to each specific aspect of their systems, and this may involve implementing multiple types of database and integrating them into a single solution. The result is a polyglot solution.
Designing and implementing a polyglot system is not a straightforward task and there are a number of questions that you will face, including:
- How can you implement a uniform data access strategy that is independent of the different databases? The business logic of an application should not be dependent on the physical structure of the data that it processes as this can introduce dependencies between the business logic and the databases. This issue is especially important of your data is spread across different types of database, where a single business method might retrieve data from a variety of data stores.
- How can you make the best use of different databases to match the business requirements of your applications? This is the driving force behind building polyglot solutions. For example, if your applications need to store and retrieve structured data very quickly you might consider using a document database, but if you need to perform more complex analyses on the same data then a column-family database might be more appropriate. On the other hand, if you need to track and manage complex relationships between business objects, then a graph database could provide the features that you require. In some situations, you might need to support all of these requirements in tandem, in which case you may need to partition the data across different types of database.
- How can you maintain consistency across different databases? For example, an e-commerce application might store customer information in one database and information about orders in another. It is important for the information in these two databases to be consistent; an application should not be able to create and store orders for nonexistent customers.
The developers at Adventure Works implemented the Repository pattern to address the first bullet point in the preceding list. The MvcWebApi web service defines repository classes for each of the different data stores that the controllers use, and these repository classes abstract the details of the underlying database from the controllers. This strategy has been discussed extensively throughout this guide.
The purpose of this chapter is to address the questions raised by the second and third points above, and describe possible solutions that you can implement in your own applications.
Partitioning Data Across Different Types of Database
Up until relatively recently, a traditional in-house data-processing solution would utilize a single database technology, most often a relational database, to provide the repository for the information that it stored and processed. As described in Chapter 1, "Data Storage for Modern High-Performance Applications," and Chapter 3, "Implementing a Relational Database," relational systems have undoubted strengths and flexibility, but they also have their weaknesses and limitations. These weaknesses have been the driving force behind the rise of NoSQL databases.
Throughout this guide, it has been pointed out that different NoSQL databases are best suited to different scenarios and use cases, and that the requirements of a typical business application often transcend these use cases. The increasing abundance of highly scalable, highly available, and relatively cheap NoSQL database systems makes building solutions that combine multiple database technologies together a viable and cost effective option. The challenge is in deciding which functionality is most appropriate to a specific type of NoSQL database, and then integrating the resulting conglomeration of technologies into a functional, maintainable application.
There are no hard and fast rules for mapping functionality to specific types of NoSQL databases, but the following list provides some general guidelines that you can consider as a starting point:
- If your application simply stores and retrieves opaque data items and blobs that it identifies by using a key, then use a key/value store.
- If your application is more selective and needs to filter records based on non-key fields, or update individual fields in records, then use a document database.
- If your application needs to store records with hundreds or thousands of fields, but only retrieves a subset of these fields in most of the queries that it performs, then use a column-family database.
- If your application needs to store and process information about complex relationships between entities, then use a graph database.
In many cases, you will encounter situations that cut across this rather simplified set of guidelines. If you need to store and modify highly-structured information, but also need to maintain information about dynamic and possibly complex relationships between records, then should you use a document database or a graph database? Document databases typically don't handle dynamic, complex relationships well, and graph databases tend not to support the rich data-structuring capabilities of a document database. One answer is to use both and take advantage of the synergy that such a solution can provide if it is designed carefully.
Chapter 5, "Implementing a Document Database" and Chapter 7, "Implementing a Graph Database"**presented different aspects of similar examples that could benefit from this approach. In the example in Chapter 5, an application stores information about employees and their managers. To recap, one possible way of structuring this information as a collection of documents is shown in Figure 1.
Figure 1 - Storing employee details as a collection of documents
These documents are optimized for simple queries pertaining to the details of a single employee. However, if an application needs to generate an organization chart that lists all the employees that report directly or indirectly to a manager, the application may need to read every document in the entire collection, and perform a significant amount of nontrivial processing to generate the hierarchical structure of the chart. One way of simplifying this processing is to store the information about the managers that an employee reports directly and indirectly to as materialized paths (also described in Chapter 5), but if managers regularly change (as they do in some organizations), then maintaining these materialized paths for a large number of employees could be an expensive and time-consuming process.
The obvious answer to this problem is to store the data in a graph database, as described in Chapter 7. Figure 2 shows the data from Figure 1, but held as a graph of employee nodes.
Figure 2 - Storing employee details as a graph
A graph database such as this is ideal for performing queries that walk down hierarchical relationships. However, if the application needs to store highly structured and complex information about each employee (such as the details of their most recent review), then a graph database might not provide the capabilities necessary to define these structures.
The solution is to store the detailed information about each employee as a collection of documents in a document database, and maintain the information about the managers that they report to as a graph in a graph database. In this way, the data for each employee can be as complicated as the document database will allow, and the graph database only needs to hold the essential details of each employee needed to perform the various graph-oriented queries required by the application. Figure 3 shows this hybrid combination.
Figure 3 - Storing employee details in a document database, and organizational details in a graph database
The employee documents and the employee nodes share a common set of employee IDs. The application can use the employee ID from a node in the graph database to find the details of that employee in the document database. Additionally, the graph database includes copies of the properties required to generate the organization chart. These copied properties help to optimize the application code that creates the organization chart as it does not need to retrieve this information from the document database, and it consists of information that does not change frequently (more dynamic data, such as salary and role information, is best left in the document database to avoid the overhead of maintaining it in two places).
This strategy is extremely flexible, enabling you to mix and match the databases necessary to the data storage and processing requirements of the application. For example, if you needed to store a complete salary and appraisal history for each member of staff, you could incorporate a column-family database. However, the cost of this approach is the additional logic that the application must implement when it needs to add, modify, or delete information because it must ensure that the data is consistent across all databases. This issue is discussed in the next section.
Managing Cross-Database Consistency
The primary reason for splitting information across different types of databases is to optimize the way in which applications store and retrieve data. However, each database is an independent repository and in most cases a database has no knowledge of any other databases used by the system. Maintaining consistency across databases is therefore an application issue, and it is the responsibility of the data access logic in the application to ensure that inserts, updates, and deletes that affect the data in multiple databases are actually propagated to those databases.
In the relational world, an application typically handles cross-database consistency by implementing distributed transactions that implement the Two-Phase Commit (2PC) protocol. However, as described in the section "Performance and Scalability" in Chapter 1, "Data Storage for Modern High-Performance Applications," the 2PC protocol can cause bottlenecks in an update-intensive system due to data being locked while the databases that participate in the transaction coordinate with each other, and is not ideally suited to running on slow or unreliable wide-area networks such as the Internet.
Jana says: | |
---|---|
If your application performs OLTP operations, then a relational database is likely to be a better solution than a NoSQL database. However, you should be prepared to ask whether your application actually needs to implement operations as transactions, or whether you can trade immediate consistency for performance. |
One of the main tradeoffs that NoSQL databases make is that of consistency against performance; they prefer to be quick rather than immediately accurate, as long as the system becomes consistent at some point. As a result, although most NoSQL databases support the notion of atomic writes to a single aggregate (a value in a key/value store, a document in a document database, or a collection of columns in a column family), most do not implement transactions that span multiple records (graph databases are an exception, due to the requirement that relationships must not reference nonexistent nodes, as described in Chapter 7, "Implementing a Graph Database"). In many cases, the immediate consistency of ACID transactions is not necessary and it may be perfectly acceptable for there to be a degree of lag while different databases are updated. In these situations, a solution can opt to implement eventual consistency, but the important point is that any changes must ultimately be reflected across every database, and the details of any change should never be lost.
Jana says: | |
---|---|
A few NoSQL databases provide limited transactional capabilities. Additionally, some NoSQL databases support integration with third-party libraries that enable an application to lock records during a series of operations to ensure that these records are not changed by another concurrent application. However, you should beware of possible performance problems and blocking that could occur if an application fails to release locks in a timely manner (or forgets to release them altogether). It is usually safer to design your solutions around the eventual consistency model and eliminate any need for locking. |
You can implement eventual consistency in a variety of ways, and the following sections describe some possible strategies.
Handling Consistency in the Business Logic of a Solution
The section "Integrating NoSQL Databases into a Polyglot Solution" in Chapter 1, "Data Storage for Modern High-Performance Applications" described an approach for hiding the data access logic for different NoSQL databases behind a web service façade. The façade provides an operational interface that exposes the business methods required by an application, and the implementation ensures that the requests that the business logic an application performs are directed towards the appropriate database(s). This is the approach that the designers at Adventure Works took when they built the MvcWebApi web service for the Shopping application.
In this architecture, the web service implementing the façade acts as the point of integration for the various NoSQL databases. The logic in the methods that the web services exposes handle the various database interactions, storing and retrieving data from one or more databases. Furthermore, these web methods take on the responsibility for detecting any failures when saving or fetching data, handling these failures, and if necessary implementing any compensation logic that undoes the work performed by an operation that cannot be completed successfully. How a method handles failures depends ultimately on the business scenario and the tolerance that an application has to these failures. In some cases, an error when saving data can be ignored if the data is transient or short lived while in other cases the failure may be more critical and require correction. For example, in an application that monitors the second by second changes in stock prices, the failure to save a single change to a price may not be critical as the price may change again very quickly (this scenario assumes that the data is being recorded for analytical purposes rather than as part of an online share trading system). On the other hand, in a banking system that transfers money between accounts, it is vital that every change is safely and accurately recorded.
Another key issue is idempotency. If possible, you should seek to eliminate any dependencies on the order in which the different repositories in a polyglot solution are updated. This strategy can help to increase throughput and alleviate some of the complex failure logic that you might otherwise have to implement. For example, if a business operation modifies information in several databases, it may be possible to perform the updates in parallel by using multiple threads. In a widely distributed environment performing a large number of concurrent operations, it is possible that an individual update may fail due to transient networking issues or timeouts. If the various updates performed by the business operation are idempotent, then a failing update against a specific repository can simply be retried without affecting any of the other updates. Only if an update fails irretrievably might it be necessary to undo the work performed by the other updates that comprise the operation.
How the Shopping Application Creates an Order
In the MvcWebApi web service used by the Shopping application, the data for an order spans three different databases; a key/value store, a SQL Server database, and a document database. The following steps describe the operations that affect an order:
- When a customer browses the product catalog, the customer can add items to their shopping cart. This shopping cart is persisted in a key/value store and is preserved between sessions. The customer can log out, and when the customer returns the shopping cart is restored.
- When the customer places an order, the items in the shopping cart are used to construct an order. The details of the order are passed to the orders service which records the order in the order history database with the status set to Pending and then stores the order information in the database used by the warehousing and dispatch systems inside Adventure Works (these systems are responsible for fulfilling the order and dispatching it to the customer). When the order is processed, the system writes another record to the order history database with the status set to Complete to indicate that that the order is complete.
- As the order is processed, any changes to the order are also recorded in the order history database.
The Post method in the OrdersController is responsible for creating an order from the information in a customer's shopping cart. It communicates with the order service, which saves a record of the order in the order history database and then passes the details of the order to the warehousing and dispatching systems inside Adventure Works. The order service implements the IOrderService interface. This interface exposes a method called ProcessOrder that takes an Order object as its parameter. The following code shows the relevant parts of the Post method in the OrdersController that invokes the order service:
public class OrdersController : ApiController
{
private readonly IOrderHistoryRepository orderHistoryRepository;
private readonly IShoppingCartRepository shoppingCartItemRepository;
private readonly IPersonRepository personRepository;
private readonly IProductRepository productRepository;
private readonly IOrderService orderService;
...
public HttpResponseMessage Post(OrderInfo order)
{
...
// Find the shopping cart for the user
var shoppingCart = this.shoppingCartItemRepository.GetShoppingCart(
order.CartId.ToString());
...
// Create and populate a new Order object (called newOrder)
// with the list of products in the shopping cart (most details omitted)
var newOrder = new Order()
{
...
Status = OrderStatus.Pending;
};
...
try
{
this.orderService.ProcessOrder(newOrder);
}
catch (Exception ex)
{
// this save failed so we cannot continue processing the order
return Request.CreateErrorResponse(
HttpStatusCode.InternalServerError, ex.Message);
}
// since we have captured the order
// we can delete the users shopping cart now
this.shoppingCartItemRepository.DeleteShoppingCart(
shoppingCart.UserCartId);
...
}
...
}
The first part of the Post method retrieves the shopping cart from the ShoppingCartRepository, and uses this information to create a new Order object (most of this code is omitted in the example above). The status of this order is initially marked as Pending. The code in the try … catch block invokes the ProcessOrder method of the order service to save the order, and if this step is successful the shopping cart is deleted. Removing the shopping cart helps prevent the same order from being placed again accidentally.
Note
If the shopping cart is not deleted correctly, the logic in the various controllers that respond to requests from the user interface is responsible for detecting this failure and handling it appropriately. The section "How the Shopping Application Prevents Duplicate Orders" describes this scenario in more detail.
The sample application implements the order service in the OrderService class. You can find this class in the DataAccess.Domain.Services project in the Domain folder of the solution. The ProcessOrder method in this class saves the order details (with a status of Pending) to the order history repository and then submits the order for processing. Handling the order may take some time, so the ProcessOrder method sends it to a queue. A separate service can take the order and process it in its own way. The following code example shows the relevant parts of the ProcessOrder method in the OrderService class:
public sealed class OrderService : ...
{
private readonly IOrderHistoryRepository orderHistoryRepository;
private readonly IOrderQueue orderQueue;
...
public void ProcessOrder(Order newOrder)
{
...
if (newOrder.Status != OrderStatus.Pending)
{
// No need to do anything if the order is not Pending.
return;
}
// Save the order to history as Pending to make sure
// we capture the order
this.orderHistoryRepository.SaveOrderHistory(
new OrderHistory(newOrder));
// At this point you have everything you need from the requestor.
// What you can do next is to send a message with the trackingId
// to a queue.
this.orderQueue.Send(newOrder.TrackingId);
}
...
}
Once the order has been sent to the queue, the OrderService class considers that the order has been placed and will be fulfilled. It is the job of the****service that retrieves the order from the queue and that handles the order to ensure that this is actually the case.
How the OrderProcessor Service Handles an Order
The sample solution mimics a reliable queue by using the SimulatedQueue class. The Send method in this class simulates the process of reading the message that contains the order and passes it to an instance of the OrderProcessor class for handling. The OrderProcessor could fail for a variety of reasons. The cause of the error might be transient, and in the real world, the message containing the order should not be removed from the queue. Rather, it should remain on the queue so that it can be read and handled again.
Note
In theory, the OrderProcessor could continually fail to handle an order, and rather than repeatedly retrying the operation, after a certain number of attempts the system should raise an alert and allow an operator to intervene and correct any problems. The DeliverMessage method in the SimulatedQueue class shows how you can structure your code to achieve this.
The OrderProcessor class in the sample solution provides the SaveOrderAndSaveOrderHistory method that writes the details of the order in the orders database and adds a second record for the order in the order history database with the status set to Completed. The method has to save the order information to both databases for the operation to be successful. However, either or both of these actions could fail causing the method to abort and leaving the system in an inconsistent state. This eventuality should cause the processing in the SimulatedQueue object to fail and in turn cause the message containing the order to remain in the queue, as described earlier. The SimulatedQueue object can read the message again and pass it to the SaveOrderAndSaveOrderHistory method for another attempt at saving the order. However, one or possibly both of the writes performed by this method might actually have succeeded in the earlier attempt (the failure could be due to some other factor). The system should not allow the order to be duplicated, so the SaveOrderAndSaveOrderHistory method performs its actions as an idempotent process. Before each write, the SaveOrderAndSaveOrderHistory method reads the appropriate database to determine whether the order details have already been saved in that database. If they have, then there is no need to write this information again.
Eventually, the system will either become consistent, or an administrator will have been alerted to take the necessary corrective actions.
The following code example shows the code in the SaveOrderAndSaveOrderHistory method of the OrderProcessor class that saves the order details to the orders database and the order history database:
private void SaveOrderAndSaveOrderHistory(Guid trackingId)
{
// This method writes to two different databases.
// Both writes need to succeed to achieve consistency.
// This method is idempotent.
bool isOrderSavedInOrderRepo =
this.orderRepository.IsOrderSaved(trackingId);
bool isOrderSavedInOrderHistoryRepoAsCompleted =
this.orderHistoryRepository.IsOrderCompleted(trackingId);
if (!isOrderSavedInOrderRepo || !isOrderSavedInOrderHistoryRepoAsCompleted)
{
...
if (...)
{
if (!isOrderSavedInOrderRepo)
{
// Once the following orderRepository.SaveOrder method is
// completed successfully, the flag
// isOrderSavedInOrderRepo should return true the next time
// the SaveOrderAndSaveOrderHistroy
// method gets called for the same trackingId.
this.orderRepository.SaveOrder(newOrder);
}
if (!isOrderSavedInOrderHistoryRepoAsCompleted)
{
newOrder.Status = OrderStatus.Completed;
// Once the following SaveOrderHistory method is
// completed successfully, the flag
// isOrderSavedInOrderHistoryRepoAsCompleted should
// return true the next time the
// SaveOrderAndSaveOrderHistroy method gets called for
// the same trackingId.
this.orderHistoryRepository.SaveOrderHistory(
new OrderHistory(newOrder));
}
}
}
}
}
It is also possible that the order history contains orders that are marked as Pending but that never get processed for some reason. You can implement a separate process that periodically sweeps through the order history database to find such orders and add them to the queue for processing by the OrderProcessor class.
How the Shopping Application Prevents Duplicate Orders
It is important to understand that if the shopping cart is not deleted when an order is placed the order is still processed. However, it is still necessary to remove the shopping cart at some point to prevent the same order from being repeated. In the Shopping application, failure to delete the shopping cart is considered to be a user-interface issue rather than a business problem, and is handled outside of the Post method. For example, the MvcWebApi web service detects whether a stray shopping cart exists for a customer when the customer logs in and delete it.
public class AccountController: ApiController
{
private IPersonRepository personRepository;
private IShoppingCartRepository shoppingCartRepository;
private ISalesOrderRepository orderRepository;;
...
[HttpPost]
public HttpResponseMessage Login(LoginInfo loginInfo)
{
...
var person = this.personRepository.GetPersonByEmail(loginInfo.UserName);
...
// Check for an orphaned shopping cart var shoppingCart = this.shoppingCartRepository.GetShoppingCart( person.PersonGuid.ToString()); if (this.orderRepository.IsOrderSubmitted(shoppingCart.TrackingId)) { // Order is already submitted, delete the shopping cart; this.shoppingCartRepository.DeleteShoppingCart( shoppingCart.UserCartId); }
...
}
...
}
Each order is identified by a unique tracking ID by the internal systems inside Adventure Works. When a customer first places an item in a new shopping cart, the constructor for the ShoppingCart domain entity class generates this ID and stores it in the TrackingId field, as shown below:
public class ShoppingCart
{
...
public ShoppingCart(string userCartId)
{
this.UserCartId = userCartId;
this.TrackingId = Guid.NewGuid();
}
public string UserCartId { get; set; }
...
public Guid TrackingId { get; set; }
...
}
When a customer logs in and has a shopping cart saved from a previous session, it queries the orders database to find out whether any existing orders have the same value for the tracking ID as that recorded in the shopping cart. If so, the cart had not been deleted correctly when the order was placed, so it removes it (this attempt may fail again, of course).
This check assumes that after a customer has placed an order they are likely to log off rather than place another order (this is the most common pattern of use in many merchant web systems). In situations where this is not the case, the check performed by the Login method is not sufficient on its own to prevent duplicate orders from being submitted. In this case, it is necessary for the order processing logic to detect the duplicate order. In the sample solution, the SaveOrderAndSaveOrderHistory method in the OrderProcessor class performs just such a check, and only attempts to save the order if it has not already been processed (it must have the status Pending rather than Completed) otherwise it is silently discarded:
private void SaveOrderAndSaveOrderHistory(Guid trackingId)
{
...
if (!isOrderSavedInOrderRepo || !isOrderSavedInOrderHistoryRepoAsCompleted)
{
var newOrder = this.orderHistoryRepository.
GetPendingOrderByTrackingId(trackingId);
if (newOrder != null)
{
...
}
}
}
Synchronizing Databases by Using a Batch Process
Depending on the business scenario, you can implement eventual consistency between databases by synchronizing them periodically. In the organization example described previously, you can arrange to run a script at regular intervals that detects the changes that have occurred to employee data in the document database since the last synchronization, and use this information to update the graph database.
Note
The organization example depicts one-way synchronization and assumes that the document database is the authoritative source of information. If you need to implement bidirectional or even multi-way synchronization between a set of databases, then this solution might not be appropriate, and a better strategy could be to propagate updates as they occur by using events, as described in the next section.
This strategy requires that you can easily identify the modifications (including creates and deletes) that have been made recently. You may be able to extract this information from the database log file or journal. Many NoSQL databases are open source, and provide some information on the structure of the log files that you can use to develop your own applications that need to read this data.
Note
The database log may be truncated or removed if the server is shutdown cleanly. Many NoSQL databases use the presence of a log file to detect whether the database was not shut down properly previously, and then use the information in the log file to recover any missing data from the database before deleting it. Additionally, many NoSQL databases also perform periodic checkpoints and truncate the database log after ensuring that all the changes it contains have been safely written to the database.
Alternatively, you can implement a collection of Log documents in the database and add the details of any changes to employee documents to this collection. The synchronization process can iterate through this collection to determine the changes that it needs to make, and then clear out all documents for records that have been synchronized when the process completes. Figure 4 illustrates this approach.
Figure 4 - Using a collection of Log documents to synchronize databases manually
In this example, employee 502 has left the company and the Status field in the employee database is recorded as "No longer with company." The code that makes this change writes a corresponding document to the Log collection. A new employee is added (507), and the Manager field for employee 506 is changed to 507. The Log collection also contains a record of these changes. When the graph database is synchronized with the document database, the synchronization process reads the changes from the Log collection and makes the appropriate amendments to the graph before removing the records from the Log collection.
Note
It is important to implement the synchronization process in an idempotent manner. If the synchronization process fails or the Log collection is not flushed correctly, then it should be possible to run the synchronization process again without worrying about which changes it had previously applied.
However, while this strategy is relatively simple, it imposes an overhead on the document database and requires additional logic in the application code that maintains the data in this database (it must record the details of changes in two places—in the Employees collection and in the Log collection). You should also remember that in most document databases, writes to multiple documents (such as an employee and a log document) are not atomic so there is a small possibility that an employee document could be updated but the change not recorded in the Log collection.
This strategy is only suitable if your application can tolerate a significant amount of latency (possibly several hours, or even days) in the consistency between databases. In scenarios such as the organization example where the data is not time critical, this latency would probably be perfectly acceptable. You may be able to automate the synchronization process to allow it to run as a timed job every few minutes, but if you require lower latency then this may not be the most suitable approach.
Note
You can also use a pattern such as Event Sourcing to tackle this problem; make the Log collection the authoritative source of information and synchronize the Employees collection and the graph database with the data in the Log collection. For more information, read the section “Introducing Event Sourcing” in the guide “CQRS Journey”, available on the MSDN website.
How the Shopping Application Verifies Product Pricing and Availability
In the Shopping application, the data that describes products is held in two places; the SQL Server database used by the warehousing system inside Adventure Works, and the product catalog implemented by using MongoDB. The MongoDB database is replicated to spread the load across a collection of sites and servers, as described in the section "Implementing Scalability and Availability for the Product Catalog" in Chapter 5.
As far as the Shopping application is concerned, the product catalog is a read-only document database. However, outside of the Shopping application, it is updated periodically by a batch process with data from the warehousing database (it is not critical that the data in the product catalog is completely up to date). The approach taken is very straightforward; a script that uses a series of SQL statements to retrieve the most recent details of every product in the SQL Server database runs at monthly intervals. This script then connects to the master node in each MongoDB replication cluster, removes the existing category, subcategory, and product document collections, and then recreates them with the new data before rebuilding the indexes over these documents. The whole process takes no longer than one or two minutes to perform, and while it is ongoing users are still able to query the product catalog through the subordinate nodes in each replication cluster.
Note
The developers adopted the same strategy for the graph database that holds the details of product recommendations (like the product catalog, it is not critical that this information is totally up to date). They implemented a monthly batch process that analyzes the recent orders in the SQL Server database used by the warehousing system, and extract the combinations of products that are most frequently bought together. The batch process then uses this data to update the graph database running at each Windows Azure datacenter.
It is possible that the price or availability of a product may change in the warehousing database, but this change will not be reflected until the next time the database is refreshed. The warehousing database is considered to be the authoritative source for product availability and pricing information. To counter possible inconsistencies between the two databases, when a customer places an order, the Post method in the OrdersController examines the items in the customer's shopping cart and cross-checks them against the warehousing database by calling the InventoryAndPriceCheck method of the InventoryService object that simulates the warehousing system inside Adventure Works (the section "Retrieving Data from the SQL Server Database" in Chapter 3 provides information on how the InventoryAndPriceCheck method works). If any item is discontinued or the price has changed since the product catalog was last updated, the OrdersController returns without placing the order and passes a response back to the user interface that enables it to alert the customer of the discrepancy. The following code example shows the relevant parts of the Post method:
public class OrdersController : ApiController
{
...
private readonly IShoppingCartRepository shoppingCartItemRepository;
private readonly IPersonRepository personRepository;
private readonly IProductRepository productRepository;
...
private readonly IInventoryService inventoryService;;
...
public HttpResponseMessage Post(OrderInfo order)
{
...
var shoppingCart = this.shoppingCartItemRepository.GetShoppingCart(
order.CartId.ToString());
...
if (this.inventoryService.InventoryAndPriceCheck(shoppingCart))
{
// There was an inventory or price discrepancy between the
// shopping cart and the product database
// so we save the shopping cart and return an OK since
// there is nothing RESTfully wrong with the message.
// The UI will be responsible for handling this by
// redirecting to an appropriate page.
this.shoppingCartItemRepository.SaveShoppingCart(shoppingCart);
var cartItems = new List<CartItem>();
Mapper.Map(shoppingCart.ShoppingCartItems, cartItems);
return Request.CreateResponse(HttpStatusCode.OK, cartItems);
}
...
}
...
}
The section "Placing an Order" in Chapter 2, "The Adventure Works Scenario" describes how the Shopping application handles this situation. The new details from the warehousing system are displayed and the user can either discard the contents of the shopping cart or proceed and place the order with the updated information.
Synchronizing Databases by Using Events
Event-based synchronization provides an approach that has much lower latency than batch synchronization, but is also architecturally more complex. Each database requires a separate process that maintains the data in the database in response to events.
In this strategy, as an application modifies the data in a database, it also triggers events. The processes that manage each database listen for these events and take the appropriate action. Therefore, this mechanism requires an infrastructure that can propagate these events in a distributed environment, such as a service bus. There are a number of commercial service bus implementations currently available including Windows Azure Service Bus. A variety of open source solutions are also available.
Many service bus implementations, including that provided by Service Bus, are based on reliable message queuing. In the case of Service Bus, an administrator creates a message queue in the cloud, and applications can connect to the message queue to send and receive messages through well-defined endpoints.
Service Bus supports transactional semantics when sending messages to a queue. An application can initiate a local transaction, post a series of messages, and then complete the transaction. If the application detects an error during the transaction it can abort, and the messages that it has posted will be removed. If the application completes the transaction, all the messages will be sent.
Note
Service Bus transactions only effect operations that send or receive messages on a queue, and they cannot enlist other transactional stores such as a relational database. For example, if you start a Service Bus transaction, send several messages to a queue and update a database, and then abort the transaction, the messages will be removed from the queue but the database update will not be rolled back.
On the receiving end, an application can connect to the queue, read messages, and process them. When an application reads a message it is removed from the queue. When the application has finished with the message it can indicate that processing was completed successfully, but if an error occurs it can abandon the message, which causes it to be returned to the queue ready to be read again. Message queues can also have an associated timeout, and if the application that reads a message does not indicate that it has finished with the message before this timeout expires the service bus assumes that the application has failed and returns the message to the queue.
Once a message has been read and processed, it is no longer available to any other applications that may need to read the same message. In a NoSQL polyglot solution, it is possible that the details of a single update might need to be passed to several databases. In this case, you need a publish/subscribe model, where a single message can be sent to all interested parties, and not just the first one that happens to read it. Many service bus implementations provide just such a model. In the case of Windows Azure Service Bus, you can use topics and subscriptions. A topic is very similar to the sending end of a message queue since an application can post messages to it (they also provide the same transactional capabilities as Windows Azure Service Bus message queues). Different applications can then subscribe to this topic, and in effect they get their own private message queue from which they can read messages. Messages posted to the topic are copied to each subscription, and an application retrieves messages from the subscription in the same way as it fetches messages from a queue. Subscriptions also provide the complete/abandon mechanism with timeouts available with message queues.
By using the topic/subscriber model, an application that needs to update several databases can create a message that describes this update and post it to a topic. Subscribers that handle each of the databases can read the message and perform whatever processing is appropriate at their end. Figure 5 shows this approach:
Figure 5 - Using a service bus to synchronize databases as changes occur
The reliable messaging capabilities of the topic, coupled with the guaranteed delivery semantics of each subscription ensure that updates are not lost, even if one or more of the subscribing processes should fail. In this case, when the subscriber restarts, the message describing the update will have been returned to the queue and can be read again. In this way, the service bus becomes the primary point of integration between databases, and you do not have to include complex failure detection and retry logic in the business logic of your application code. However, undoing an operation can be a challenge. In a typical SQL transaction, if the business logic detects that a transaction should not be allowed to complete for whatever reason, it can simply abort the transaction and all work performed will be rolled back. In a distributed environment based on messaging, if a subscriber determines that the operation should be undone, it may need to post one or more messages to the service bus topic that cause the subscribers to reverse the effects of the work that they have performed.
Summary
This chapter has examined some of the important issues that arise when you design and implement a polyglot solution. In particular, it has focused on strategies that you can employ to partition your data across different types of NoSQL databases, and techniques that you can use for implementing the application logic that needs to ensure consistency across different databases.
Different NoSQL databases are often designed and optimized to support specific data access scenarios. The advantage of using a collection of different NoSQL databases is that you can exploit their strengths and match them to the requirements of the various business scenarios that your applications need to support. The cost of this approach is the additional complexity and infrastructure required to combine the different technologies into a seamless solution. Using the Repository pattern can help to isolate the data access code from the business logic of your applications, but maintaining consistency across a collection of nonhomogeneous databases can be a significant challenge. If your data must be immediately and absolutely consistent all of the time, then you may need to revert to a transactional solution based on a single relational database, although this approach will inevitably limit the scalability of your solution. However, in many cases, immediate consistency is not a business requirement, and you can implement eventual consistency by synchronizing data periodically or by using events to propagate changes as they occur, depending on the latency that your business is willing to tolerate.
More Information
All links in this book are accessible from the book's online bibliography on MSDN at: https://msdn.microsoft.com/en-us/library/dn320459.aspx.
- The Repository pattern is described in detail in "Patterns of Enterprise Application Architecture" website on Martin Fowler's website.
- The article "The Repository Pattern" on MSDN describes an implementation of the Repository pattern by using SharePoint, but you can apply the general principles to almost any data store.
- You can find more information about Windows Azure Service Bus on the "Messaging" page of the Windows Azure website.
- The section "Introducing Event Sourcing" in the guide "CQRS Journey", available on the MSDN website contains detailed information about using the Event Sourcing pattern to record the history of the changes made to the data held by an application and using this information to generate a view of the system state.