Volume 27 Number 08
The Working Programmer - Cassandra NoSQL Database: Getting Started
By Ted Neward | August 2012
The ancient Greeks told the story of Cassandra, the daughter of King Priam and Queen Hecuba of Troy. She was one of the most beautiful women of her generation. When offered the gifts of a prophetess by the Greek god Apollo, she quickly accepted, but when she later spurned his amorous advances, Apollo cursed her to always know the truth and never be believed by any to whom she spoke it. Thanks to her gift of prophesy, Cassandra foresaw the trap presented by the Trojan horse, but thanks to her curse of disbelief, no one in Troy would listen to her warnings. They brought the horse within the city walls, and unwittingly invited the Greek soldiers hidden therein into the city, which led to Troy’s fall. Cassandra was taken as a war prize back to Greece by Agamemnon, where she again foresaw the future: his (and her) death, but was again disbelieved—and, sure enough, both he and she were killed.
Modern computer science geeks tell the story of Cassandra a little differently, as Apache Cassandra, another of the “NoSQL” databases—and a popular one at that—in use at a variety of well-known Internet-based companies (YouTube, Netflix and others), and presumably one whose reports are actually taken at face value. (Rumor has it that Cassandra is a pun on another famous prophetess, the Oracle of Delphi.)
To the developer, Cassandra the software can be just as confusing as Cassandra the Trojan. It’s “an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable” (source: “Cassandra: The Definitive Guide,” O’Reilly Media, 2010, p. 14).
Sometimes I think the Greek myths make more sense than my industry does.
Breaking all that down, we see that:
- Cassandra is built to store lots and lots and lots of data (hundreds of terabytes seem to be a commonly cited example) across a variety of machines arranged in a ring, as opposed to the trend within relational database thinking that says “buy a bigger box” (for scaling horizontally, rather than vertically).
- Cassandra has a data model that looks like the relational database’s data model on the surface, sounds kind of like it with its discussions of columns, column families and named values, but acts nothing like it in practice.
More relevant to this discussion, Cassandra has been gaining momentum within the developer community as a worthwhile tool to have in the toolbox, so it seemed like a good idea to turn our collective columnar gaze upon a column-oriented database. (Pun intended.)
Cassandra is not a relational data store, despite its use of the term “column-oriented.” In fact, it doesn’t really look anything at all like a relational database. Instead of storing a schema, for example, that guarantees the various rows of data in the table are all alike, Cassandra stores “column families” in “keyspaces.” A keyspace is really just an administrative isolation barrier, in much the same way that relational database instances are separated from one another on the same server, but a column family is a completely different beast. Each column family is made up of “rows” identified by a key, but within a row, any number of name/value pairs (columns) can be present, and each row can contain entirely different data elements from the other rows within the column family.
In practical terms, let’s suppose we’re using Cassandra to store a collection of people. Within the keystore “Earth,” we’ll have a column family called “People,” which in turn has rows that look like this:
RowKey: tedneward ColumnName:"FirstName", ColumnValue:"Ted" ColumnName:"LastName", ColumnValue:"Neward" ColumnName:"Age", ColumnValue:41 ColumnName:"Title", ColumnValue:"Architect" RowKey: rickgaribay ColumnName:"FirstName", ColumnValue:"Rick" ColumnName:"LastName", ColumnValue:"Garibay" RowKey: theartistformerlyknownasprince ColumnName:"Identifier", ColumnValue: <image> ColumnName:"Title", ColumnValue:"Rock Star"
As you can see, each row contains conceptually similar data, but not all rows will have the same data (though if the variance grows too large, it might get confusing for developers to use). Storing pets in here, for example, would likely create too much chaos. This is why any nontrivial application will likely use dozens or hundreds of different column families.
By the way, I’m lying (slightly) to you when I say that a row is made up of name/value pairs; it’s actually made up of name/value/timestamp triplets, but the Cassandra docs make it pretty clear that the timestamp part of the triplet is only for conflict detection and is never to be used as part of your application logic. Most Cassandra articles essentially tell new Cassandra developers to ignore it.
This all makes more sense once you see it in action, so let’s get Cassandra running.
Before you can do anything with Cassandra, you have to get it installed, and therein lies the first hurdle: Cassandra is, as advertised, an open source project, and like many open source projects, it’s not written in a Microsoft .NET Framework language. Instead, Cassandra is written in Java, and as such requires a relatively modern Java runtime to be installed on your machine in order to execute. Cassandra runs fine with Java 6 (and, in fact, most of the blog posts on the subject suggest it), but should run just as well if not a touch faster with the most recently released Java 7.
(If you’ve never installed Java on your machine before, just plug “Java Runtime Environment 6 (or 7) download” into your search engine of choice and pull down the desired installer for either 32- or 64-bit Windows, depending on your target OS. About the only other thing you’ll need to do is set an environment variable called JAVA_HOME to point to the Java Runtime Environment (JRE) install directory—under a default installation, this will be in C:\Program Files\Java\jre6—and put the JRE’s “bin” subdirectory on the PATH if it’s not already.)
Next, pull down the Cassandra binaries from the Cassandra homepage. Unfortunately for us Windows folks, it’s only available as a .tar.gz file, which, out of the box, Windows isn’t sure what to do with. Dozens of tools are available to unarchive a .tar.gz file, including the command-line “gunzip” and “tar” utilities in Cygwin, if you want to start practicing some Unix-Fu on a Windows box. Dump the contents of the Cassandra download into a convenient directory, such as C:\Prg\apache-cassandra-1.1.0 (which is the latest version, as I write this). Then, as is common with Java projects, you need to create an environment variable that points to the root of the Cassandra install directory, so create a CASSANDRA_HOME environment variable that points to C:\Prg\apache-cassandra-1.1.0 (in my case).
If you’re a little aghast at the primitive conditions here, remember that Java projects like to work on multiple platforms (which means we have to use mechanisms that are common to all platforms, and yeah, environment variables are everywhere, even on Android). The positive side of this is that if you ever work with Cassandra on a non-Windows platform, you’ll be doing the same setup steps: get Java, get Cassandra, unarchive and set environment variables. Unfortunately, it means that our tooling isn’t quite as fancy and GUI-based as we might otherwise be used to.
Speak to Us, O Prophetess!
Speaking of which, firing up Cassandra means hopping on over to the Cassandra install directory and kicking off the batch file “cassandra.bat” found in the “bin” subdirectory. Launch that as “cassandra –f” (the “-f” causes it to run in the foreground), and you should see something like Figure 1.
Figure 1 Installing Cassandra with the Cassandra.bat File
By default, Cassandra is configured to dump data and commit logs into the “var” directory off the root of your filesystem, which Java interprets as C:\. This is more Unix-ism, and is easily configured differently in the “conf/cassandra.yaml” configuration file.
(Convenience note: A company called DataStax Inc. offers an all-in-one installer containing both the Cassandra server and JRE, as well as an HTML-based operation center product, available as a free download. If you’re having difficulties getting it all set up, you might try that instead.)
A running Cassandra server is expecting incoming connections on port 9160 and uses port 7199 for its Java Management Extensions monitoring, which is Java’s rough equivalent to Windows Management Instrumentation. Both ports will, eventually, want to be accessible to client applications and Cassandra monitoring utilities, respectively.
Once Cassandra is up and running on your box, we can connect to the running instance using the Cassandra command-line interface, launched by running “cassandra-cli.bat,” again from the Cassandra “bin” directory (see Figure 2).
Figure 2 Connecting to a Running Cassandra Instance
To create a keyspace, use “create keyspace TestKS” (which must be a unique name), and to create a column family within that keyspace, first type “use <keyspace>,” then “create column family <name>.” No other schema definition is required—the column family is a collection of name/value pairs from then on, remember.
To insert data into the column family, use the “set” command, which requires the name of the column family into which you insert (“TestCF”), the key to use for this row (“TestKey”), the column within the column family to use as the name for this value (“column”) and the value to store there (“value”). However, because Cassandra stores data as binary values, you have to tell Cassandra to interpret the row key, column name and column value as ASCII values using the built-in “ascii” function. This means the whole “set” looks like this:
Retrieving that data is basically the same exercise using the “get” command, like this:
This will return with something like this:
(column=636f6c756d6e, value=76616c7565, timestamp=1338798419726000)
This demonstrates that Cassandra does, indeed, speak gibberish (at least, to us humans—if you look carefully, those binary values are the ASCII values of “column” and “value,” respectively).
The Hardest Part Is Done
We’re out of time, and Cassandra has only been installed. Specifically, a single-node Cassandra cluster is up and running, and nothing has been done to program against it yet. Fortunately, the hardest part of getting started with Cassandra has been completed. In the next installment, I’ll start using .NET libraries to talk to Cassandra, get it to store some data from the .NET applications, pull it back, and then show how to set up a three-node cluster and get it up and running.
For now, though, happy coding!
Ted Neward is an architectural consultant with Neudesic LLC. He has written more than 100 articles and authored or coauthored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He is an F# MVP and noted Java expert, and speaks at both Java and .NET conferences around the world. He consults and mentors regularly—reach him at email@example.com if you’re interested in having him come work with your team. He blogs at blogs.tedneward.com and can be followed on Twitter at Twitter.com/tedneward.
Thanks to the following technical expert for reviewing this article: Kelly Sommers