Volume 31 Number 5
Building an Historical CRUD
By Dino Esposito | May 2016
Relational databases have been around since the 1970s and a few generations of developers started and ended their careers without learning, or just mildly considering, an alternate approach to data storage. Recently, large social networks provided strong evidence that relational databases couldn’t serve all possible business scenarios. When a (really) huge amount of schemaless data comes your way, relational databases might sometimes be a bottleneck rather than a pipe.
Can you imagine what it might take to instantaneously count the likes on a comment friends made to a post in an all-encompassing relational database with a few billion records total? Not to mention that restricting the definition of a post to a rigid schema is at least challenging. For a mere matter of business survival, social networks at some point evolved and moved their storage focus to a mix of relational and non-relational data stores, thus officially opening the business of polyglot data.
The fundamental lesson learned from the software architecture of social networks is that sometimes plain storage of the data you have is not, business-wise, the ideal approach. Instead of just storing any data you’ve got, it’s preferable to store details about an event occuring and the data involved with that particular event.
In this article, I’ll first dig out the business foundation of event sourcing—using logged events as the primary data source of applications—and then discuss how to refresh existing Create, Read, Update, Delete (CRUD) skills in light of events. To make it clear from the very beginning, the question is not whether you need event sourcing. The question is when you need it and how you code it.
Toward Dynamic Data Models
Using polyglot data is a hot topic today: relational databases for structured data, NoSQL data stores for less-structured data, key-value dictionaries for preferences and logs, graph databases for relationships and correlations. Introducing different storage models running side-by-side is a step in the right direction, but it seems to me more like an effective remedy to a visible symptom than the cure to the real and underlying disease.
The relational model owes its decade-long effectiveness to a balanced set of benefits it provides in reading and writing data. A relational model is easy to query and update even though it shows limits under some (very) extreme conditions. Overall performance comprehensibly decreases on tables with a few million records and a few hundred columns. In addition, the schema of data is fixed and knowledge of the database structure is required to create ad hoc and fast queries. In other words, in the world you code in today, a comprehensive model, like the big up-front relational model, is a huge constraint that first ends up limiting your expressivity and, later, your programming power. In the end, a model is just a model and it’s not what you observe directly in the real world. In the real world, you don’t observe any model. Instead, you use a model to encapsulate some well-understood and repeatable behavior. Ultimately, in the real world, you observe events but expect to store event-related information in a constrained (relational) model. When this proves hard to do, look into alternative storage models that just release some schema and indexing constraints.
An Event-Based Storage Model
For decades, it was helpful and powerful to save just the current state of entities. When you save a given state, you overwrite the existing state, thus losing any previous information. This behavior doesn’t deserve praise or blame, per se. For years, it proved to be an effective approach and gained wide acceptance. Only the business domain and customers can really say whether losing any past state is acceptable. Facts say that for many years and for most businesses, it was acceptable. This trend is changing as more and more business applications require tracking the full history of business entities. What was called CRUD for years—plain Create, Read, Update, Delete operations—and modeled on top of plain relational tables is now evolving in what can be generically referred to as historical CRUD. Historical CRUD is simply a CRUD code base where the implementation manages to track down the entire list of changes.
The real world is full of line-of-business (LoB) systems that in some way track events as they take place in the domain. Such classes of application existed for decades—some even written in COBOL or Visual Basic 6. No doubt, for example, that an accounting application tracks all changes that might occur to invoices such as a change of date or address, the issuing of a credit note and the like. In some business scenarios, tracking events has been a requested feature since the early days of software, often falling under the broader umbrella of the auditing functionality.
Therefore, auditing business events isn’t a new concept in software. For decades, development teams solved the same problem over and over, reworking and rehashing known techniques the best way they could possibly find. Today, the good old practice of auditing business events goes under the more engaging name of Event Sourcing.
Coding Your Way to Business Events
So let’s say you have a conceptually simple application like one that lets users book a shared resource, say a meeting room. When the user reviews the state of a given booking, she might be presented with not just the current state of the booking, but the entire list of updates since creation. Figure 1 sketches a possible timeline-based UI for the view.
Figure 1 A Timeline-Based View for the Entire History of a Booking
How would you design a booking data model that behaves as an historical CRUD rather than as a plain state-based CRUD? Adding more columns to the table definition isn’t enough. The key difference between a CRUD and an historical CRUD is that in the latter scenario you want to store multiple copies of the same entity, one per each business event that it went through at a given time. Figure 2 shows a possible new structure for the booking table of the relational database.
Figure 2 A Possible Relational Data Model for a Historical CRUD Application
The table depicted in Figure 2 has the expected set of columns that fully represent the state of a business entity, plus a few other ones. At the very minimum you want to have a primary key column to uniquely identify a row in the table. Next, you want to have a timestamp column that indicates either the time of the operation on the database or just any timestamp that makes sense for the business. More in general, the column serves the purpose of associating a safe date to the status of the entity. Finally, you want to have a column that describes the event that was logged.
It’s still a relational table and it still manages the list of bookings the application needs. No new technology was added, but conceptually the table schematized in Figure 2 is a quantum leap from a classic CRUD. Adding records to the new table is easy. You just fill out records and append them as you get notification that something has happened within the system that must be tracked down. So much for the C in CRUD; but what about other operations?
Updates and Deletes in an Historical CRUD
Once the classic relational table has been turned into an historical, event-based table, role and relevance of updates and deletions change significantly. First off, updates disappear. Any updates to the logical state of the entity are now implemented as a new record appended that tracks the new state.
Deletions are a trickier point and the final word on how to code it resides in the business logic of the domain. In an ideal event-based world, there are no deletions. Data just adds up, therefore a deletion is simply the addition of a new event that informs you that the entity doesn’t logically exist anymore. However, physical removal of data from a table isn’t prohibited by law and can still happen. Note, though, that in an event-based scenario, the entity to remove is not made of a single record, but consists of a collection of records, as represented in Figure 2. If you decide to delete an entity, then you must remove all events (and records) that relate to it.
Reading the State of an Entity
The biggest benefit you gain from logging business events in your application is that you never miss anything. You can potentially track the state of the system at any given time, figure out the exact sequence of actions that led to a given state and undo—in total or in part—those events. This lays the groundwork for self-made business intelligence and what-if scenarios in business analysis. To be more precise, you won’t automatically get these features coming out of the box with your application, but you already have any data you need to develop such extensions on top of the existing application.
The hardest part of an historical CRUD is reading data. You’re now tracking down all relevant business events in the sample booking system, but there’s no place where you can easily get the full list of standing bookings. There’s no quick-and-easy way to know, say, how many bookings you have next week. This is where projections fit in. Figure 3 summarizes the overall architecture of a system that evolves from a plain CRUD to an historical CRUD.
Figure 3 Architecture of an Historical CRUD System
An event-based system is inevitably geared toward implementing a neat separation between the command and query stack. From the presentation layer, the user triggers a task that proceeds through the application and domain layer involving all the business logic components on the way. A command is the trigger of a business task that alters the current state of the system, which means that something must be committed that logically modifies the existing state. As mentioned, in an event-based system—even when the system is a plain, simple CRUD system—altering the state means adding a record that indicates that the users created or updated a particular booking. The block in Figure 3 labeled “Event Repository” represents any layer of code responsible for persisting the event. In terms of concrete technologies, the Event Repository block can be an Entity Framework-based repository class, as well as a wrapper around a document database (Azure DocumentDB, RavenDB or MongoDB). Even more interestingly, it can be a wrapper class using the API of an event store such as EventStore or NEventStore.
In an event-based architecture, the state of a given entity is algorithmically calculated upon request. This process goes under the name of event replay and consists of querying all events that relate to the specified entity and applying all of them to a fresh new instance of the entity class. At the end of the loop, the entity instance is up-to-date because it has the state of a new instance that went through all the recorded events.
More in general, processing the log of events builds a projection of data and extracts a dynamic data model out of a lower-level amount of data. This is what is referred to as the Read Model in Figure 3. On top of the same log of events you can build all data models that serve the various front ends. To use a SQL analogy, building a projection of data from logged events is the same as building a view out of a relational table.
Replaying events to determine the current state of an entity for query purposes is generally a viable option but it becomes less and less effective as the number of events or the frequency of requests grows over time. You don’t want to go through a few thousand records every time just to figure out the current balance of a bank account. Likewise, you don’t want to walk through hundreds of events to present the list of pending bookings. To work around these issues, the read model often takes the form of a classic relational table that is programmatically kept in sync with the table of logged events.
Most applications can still be grossly catalogued as CRUD apps. The same way Facebook can be presented in some way as a CRUD, perhaps just a little bit larger than average. Seriously, for most users the last known good state is still enough, but the number of customers for which this view is insufficient is growing. The next might just be your best customer. This article scratched just the surface of an historical CRUD. Next month, I’ll present a concrete example. Stay tuned!
Dino Esposito is the author of “Microsoft .NET: Architecting Applications for the Enterprise” (Microsoft Press, 2014) and “Modern Web Applications with ASP.NET” (Microsoft Press, 2016). A technical evangelist for the .NET and Android platforms at JetBrains, and frequent speaker at industry events worldwide, Esposito shares his vision of software at software2cents.wordpress.com and on Twitter: @despos.
Thanks to the following Microsoft technical expert for reviewing this article: Jon Arne Saeteras