August 2014
Volume 29 Number 8
Cutting Edge : Documents, Databases and Eventual Consistency
Dino Esposito | August 2014
Let’s say you’ve spent the last couple of years stuck in a cave with no Internet connectivity. Now that you’re back in the real world, you’re assigned to a brand-new project. At the first technical meeting, you listen to an animated discussion between team members with strong opinions about document databases.
What the heck is a document database? Didn’t we used to store data in a plain old relational SQL Server instance? What benefits can document databases bring to the table? You diligently listen to the various opinions while trying to make sense of the whole thing. You wonder how to realistically answer whether you should consider using document databases instead of relational storage for your next project.
This article doesn’t intend to take a definitive stance regarding document databases and NoSQL stores in general. However, I hope to illustrate a skeptical perspective of someone who’s always ready to explore better ways of doing things and always looking for concrete and measurable benefits, but not at all costs.
Beyond SQL
The relational model and Structured Query Language (SQL) have been around for more than 40 years. That’s an incredible amount of time for this rapidly changing industry. Over the past few decades, the relational model has fended off attacks from object-oriented databases that seemed ready to overtake the old-fashioned relational model on the wave of object-oriented languages and modeling patterns.
That never happened, though. What did happen was the emergence of object-relational mapping (ORM) tools. In the .NET space, NHibernate initially arose as a de facto standard and was more recently equaled by the Entity Framework. There have also been other similar commercial and open source frameworks from a variety of vendors and contributors. So the first point is that object modeling is not a strong enough reason to push the relational model to the corner.
Yet more companies are using NoSQL stores. A good question to ask is, “Where are NoSQL stores being used?” This is a much better question than, “How can I take advantage of NoSQL stores?” Examining realistic technology use cases to see if they match your own is preferable over blindly trying to find reasons to use a given technology. If you explore the context of using NoSQL stores, you recognize the following common aspects:
- Large—often, unpredictably large—volumes of data and possibly millions of users
- Thousands of queries per second
- Presence of unstructured/semi-structured data that might come in different forms, but still need the same treatment (polymorphic data)
- Cloud computing and virtual hardware for extreme scalability
If your project doesn’t match any of these conditions, you can hardly expect to find NoSQL particularly rewarding. Using NoSQL outside of such conditions may just end up being a different way of doing the same old things.
Structural Differences
The need to build highly interactive applications that push a huge number of writes and reads to the database server is well-suited for NoSQL. In a relational model, relationships between tables work great as long as all the tables involved have fixed schemas. All involved tables must also provide a faithful and realistic representation of the domain model.
If your data—regardless of size and access frequency—fits nicely into “structured tables,” then you can be reasonably certain that good old SQL Server will work effectively. Classic SQL technology is far from dead, and improves over time. The column store feature in SQL Server 2014 helps you handle hundreds of columns. Having hundreds of columns may affect query performance. A huge number of columns is also sometimes the result of attempting to map semi-structured data to a relational model.
A column store is a plain table with the usual set of rows and columns, except the content is physically laid out and stored by column. As a result, all data in any given column is stored on the same SQL physical page. If you need to query all data from a selected large number of columns, this new form of storage can deliver a significant performance boost without re-architecting the persistence layer.
A growing number of today’s applications deal with more intricate and complex data models that hardly fit in structured tables. Subsequently, application architects wonder if workarounds and compromises are really necessary for storing and querying such partially structured data within the boundaries of rigid SQL schemas. If you can successfully model your data to a relational schema, then you likely incur in JOIN statements for a large number of common queries. So when tables grow beyond imagination, query performance inevitably decreases. As this will happen when a lot of customers are using the application, delayed response can affect the business behind the application.
You have two other options from which to choose—SQL and NoSQL. Each sounds like it’s replacing the other. SQL and NoSQL aren’t different flavors of the same technology. Both deal with persistence of data, but with a number of structural differences.
NoSQL databases don’t use fixed data schemas and relationships. As such, they don’t dictate a model. NoSQL databases free you from modeling the domain space. You can persist resulting objects in logical groups. Design a domain model, work with a graph of objects and the NoSQL store just handles object serialization to disk.
Working with Documents?
In this case, the term “document” is roughly equivalent to the term “record.” Collections of documents may recall tables of records. The key difference is that each object in a collection may have a different schema than all other objects in the same collection. Yet all the documents are logically related.
This distinction is more subtle than it might first appear. Suppose you’re managing the bios of book authors. You might not know the same amount of data for each author. Some data that describes Author1 may be different from data that describes Author2. If you intend the bio to be a collection of attributes, you actually have a fixed schema with a few optional columns. If you intend the bio to be a collection of files like a Word-based CV, an XML stream with the list of published books and comments, links to YouTube interviews, and a few attributes such as vital statistics, then the schema is much less defined and hardly fits within the rigid boundaries of a SQL store.
The level of polymorphism in your data is an excellent metric by which to measure how much NoSQL can help. Suppose you’re building a system in accordance with event-sourcing architecture. In an event-sourcing architecture, the persistence model of the application is the plain history of events. Every user action originates one or more events on the server side. Recording the event is all you need to keep track of the application state.
In an e-commerce scenario, for example, when the user submits an order, it starts a workflow. This might generate a number of domain events—order-submitted, order-validated, order-denied, order-created, order-being-processed, order-shipping, order-shipped, order-returned, order-updated and so on. These events all relate to the same order. Processing the sequence of events lets you build and rebuild the current state of the order.
Dealing with orders in a realistic e-commerce system isn’t just a matter of maintaining an Orders table with an updated Status column. It means tracking all actions performed on the data that represents an order. These actions may be quite different and involve different data. For example, the order-returned event might have nearly no associated data. The order-submitted event likely carries the entire set of information coming from the user. The event order-updated likely contains only the variation made to the existing order.
Your order will have more of a document feel, as it’s fully described through a list of highly heterogeneous events. Each event is just an object to store. Most e-commerce systems probably use a relational store to address these problems. A NoSQL approach to storing this list of events could be quite interesting and promising. To learn more about event-sourcing architecture, download the free e-book, “Exploring CQRS and Event Sourcing,” from the Microsoft Download Center at bit.ly/1lesmzm.
Is Eventual Consistency an Issue?
Eventual consistency is another relevant structural difference between SQL and NoSQL. Even when you can easily recognize documents in your data model, the impact of eventual consistency on deployed applications is the true discriminant to the final decision.
Eventual consistency is when reads and writes aren’t aligned to the same data. Most NoSQL systems are eventually consistent in the sense that they guarantee if no updates are made to a given object for a sufficient period of time, then a query returns what the last command has written.
In most cases, eventual consistency isn’t an issue at all. You generally need to be as consistent as possible within the Bounded Context, but you don’t really need any level of consistency across Bounded Contexts. As the system grows, despite the technology, you can’t rely on consistency.
There’s a simple test to see whether eventual consistency is an issue. How would you consider a scenario in which a command writes some data, but a successive read returns stale data? If it’s absolutely crucial you’re constantly able to read back what has just been written, then you have two options:
- Avoid NoSQL databases
- Configure the NoSQL database to be consistent
Consider the following code snippet that assumes the use of RavenDB—a popular .NET NoSQL database:
DocumentSession
.Store(yourObject);
DocumentSession
.Query<YourObjectType>()
.Where(t => t.Id == id)
The first line stores an object to the RavenDB archive. The second line attempts to read the object back, making a query on the same store session for the value of some Id property on the saved object. With the default database configuration, what you’ll read isn’t the same as what you’ve just written.
As far as RavenDB is concerned, writing on the store and updating indexes used by the query engine are distinct operations. Index updates occur as scheduled operations. That misalignment doesn’t last more than a few seconds if in the meantime there are no other updates to the same object. Here’s a way to force actual consistency:
_instance.Conventions.DefaultQueryingConsistency =
ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite;
When you do so, though, that read doesn’t return until the index has been updated. A trivial read, therefore, might take a few seconds to complete. From this, you see that NoSQL stores fill a niche in the industry. They do pose challenges, though. NoSQL addresses some architecture issues while neglecting others.
There are better ways to “wait” for indexes. Those generally depend on the scenario you’re facing. Thus, it’s better to decide the type of consistency a query needs at query time, as in the following example:
using( var session = store.OpenSession() )
{
var query = session.Query<Person>()
.Customize(c=> c.WaitForNonStaleResultsAsOfLastWrite() )
.Where( p => /* condition */ );
}
Here, the WaitForNonStaleResultsAsOfLastWrite query customization is instructing the server to wait for the relevant index to have indexed the last document written and ignore eventual documents arriving at the server after the query has been issued.
This helps in certain scenarios with a high write ratio, where indexes are always stale. This is by design. There are many WaitForNonStaleResultsXxxx methods that have different behaviors and solve slightly different scenarios. Another possibility is to fully embrace eventual consistency and simply ask the server if the returned results are stale. Then behave accordingly:
using( var session = store.OpenSession() )
{
RavenQueryStatistics stats;
var query = session.Query<Person>()
.Statistics( out stats )
.Where( p => /* condition */ );
}
In this example, you’re not waiting for indexes. You’re asking the server to also output the query statistics that let you know if the returned results are stale. Given the scenario you’re trying to resolve, you should have enough information to take the best possible decision.
Polyglot Persistence
At the end of the day, the point isn’t to call relational stores dead and replace them with the NoSQL product of choice. The point is to understand the mechanics of the system and characteristics of the data and working out the best possible architecture. The most foreseeable concrete application for NoSQL stores is as an event store in the context of an event-sourcing architecture.
For systems where there isn’t a simple yes or no answer to whether you should use relational stores, the best you can do is to consider polyglot persistence. Instead of forcing a choice between NoSQL and relational databases, you could look into a storage layer that combines the strengths of NoSQL and relational databases. This type of storage system tackles different problems in the most appropriate way.
Within the context of an enterprise, you should use different storage technologies to store different types of data. This is especially true in a service-oriented architecture. In this case, each service may have its own storage layer. There would be no reason to unify storage under a single technology or product. Polyglot persistence does require that you learn different storage technologies and products. As a training cost, though, it’s a reasonable investment.
Dino Esposito is the co-author of “Microsoft .NET: Architecting Applications for the Enterprise” (Microsoft Press, 2014) and “Programming ASP.NET MVC 5” (Microsoft Press, 2014). A technical evangelist for the .NET Framework and Android platforms at JetBrains and frequent speaker at industry events worldwide, Esposito shares his vision of software at software2cents.wordpress.com and on Twitter at twitter.com/despos.
Thanks to the following technical expert for reviewing this article: Mauro Servienti (Managed Designs)