April 2013
Volume 28 Number 04
Azure Insider - NoSQL Data in the Cloud with Microsoft Azure Tables
By Bruno Terkaly, Ricardo Villalobos | April 2013
The price of storing data on disk has dropped so dramatically it seems like science fiction, opening the floodgates for companies to store massive amounts of data. But being able to store lots of data economically solves only half the problem. The data has become so large and complex that traditional database management tools and data processing applications are vastly inadequate. With so much data on disk, new issues have arisen, such as ingesting the data, performing searches, sharing the data, analyzing it and, ultimately, visualizing it.
The power of cloud computing has stepped up to fill this need. The ability to run massively parallel software solutions—running on tens, hundreds or even thousands of servers—is the silver bullet that enables organizations to deal with all of that stored data.
Microsoft realized this important trend several years ago. Azure Storage (MAS) was launched in November 2008 and dramatically improved the ability of businesses to get value from the massive amounts of data being stored.
In the words of Brad Calder, a distinguished engineer at Microsoft and the shepherd who guided the construction of the MAS system, “Azure Storage is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time that is highly available and durable. When using Azure Storage, you have access to your data from anywhere, at any time, and only pay for what you use and store.”
MAS is used inside Microsoft for applications such as social networking search; serving video, music and game content; and managing medical records. It’s also used by the Bing search engine to provide almost-immediate publicly searchable content from Facebook or Twitter posts or status updates. With around 350TB of data, the scope of Facebook and Twitter data is remarkable. When this data is being ingested, transaction throughput reaches peaks of around 40,000 transactions per second and totals between 2 to 3 billion transactions per day.
This month we’ll explore one facet of MAS—Azure Tables—both how it works and how developers can get it up and running quickly.
The Landscape
The modern data scientist is faced with many choices when selecting a data platform, each with its own strengths and weaknesses. For example, many big data solutions are based on the concept of NoSQL, which means that a relational database management system (RDBMS) model isn’t used—there are no tables and no SQL statements. Instead, the data structures are typically a massive collection of key/value pairs or associative arrays. The popular choices today are MongoDB, Cassandra, HBase, CouchDB, Neo4j and Azure Tables. This article will focus on Azure Tables.
Despite the major differences, both SQL and NoSQL databases have one thing in common: these technologies are offered as a service in the cloud, freeing developers from having to manually provision and de-provision data servers. For example, Azure Tables is offered as a service and a developer never has to think in terms of separate physical servers.
In this month’s column, we’ll start with a brief discussion of some of the features and capabilities of Azure Tables. Next, we’ll provide some code to demonstrate how you might work with Azure Tables in terms of inserting and querying data. And, finally, we’ll take a look at some of the design goals and the high-level implementation details of MAS.
Some Basics
One of the great features of Azure Tables is that storage is offered across three geographically distributed regions, including the United States, Europe and Asia. Every Microsoft datacenter complies with the International Organization for Standardization (ISO) 27001, SSAE 16 ISAE 3402, EU Model Clauses and Health Insurance Portability and Accountability Act (HIPAA) business associate agreement (BAA) standards. Another important feature is geo-redundant storage, which allows you to replicate your data in another datacenter within the same region, adding yet another level of disaster recovery.
MAS performance and capacities are correlated to storage accounts. An individual storage account includes 200TB of storage. Azure Tables have been optimized to provide incredibly fast query performance under write-heavy workloads. You can read more at bit.ly/cMAWsZ.
Figure 1 shows the scalability targets for a single storage account created after June 7, 2012.
Figure 1 Scalability Targets for a Single Storage Account
Individual Storage Accounts | |
Capacity | Up to 200TB |
Transactions | Up to 20,000 entities/messages/blobs per second |
Bandwidth for a Geo-Redundant Storage Account | |
Ingress | Up to 5Gbps |
Egress | Up to 10Gbps |
Bandwidth for a Locally Redundant Storage Account | |
Ingress | Up to 10Gbps |
Egress | Up to 15Gbps |
MAS analytics are also available, allowing developers to trace storage requests, analyze usage trends and optimize data-access patterns in a storage account. Read more at bit.ly/XGLtGt.
Be aware that the MAS system includes other abstractions, such as blobs and queues. We’ll focus here on Azure Tables, which are used to store non-relational structured and semi-structured data. The most succinct way to express the value of Azure Tables is that they support NoSQL key-value lookups at scale and under write-heavy workloads. From a developer’s point of view, Azure Tables are for storing large collections of non-uniform objects or for serving pages on a high-traffic Web site.
Azure Tables can be accessed from almost anywhere. The entire storage system is Representational State Transfer (REST)-enabled, which means that any client capable of HTTP can communicate with the MAS system. Obvious clients include iOS, Android, Windows 8 and different Linux distros. The REST API supports inserts, upserts, updates and deletes, and selects or queries.
When working with Azure Tables, a key starting point is understanding how to control the data partitioning scheme. For any given Azure Table, the data architect must define (up front) a PartitionKey and a RowKey. This is perhaps the most important decision you’ll make when using Azure Tables. PartionKeys and RowKeys determine how your data is automatically partitioned by the storage service and the way your queries will perform. It’s recommended that you understand how your data will be queried before finalizing your decisions on PartitionKey and RowKey. Later, we’ll delve into the mechanics of transactional consistency and their relationship to PartitionKeys. For now, let’s walk through a simple example of how the MAS system partitions table data.
A Quick Tutorial
Imagine you wish to store and retrieve e-mails from various domains, such as the following: bterkaly@microsoft.com, ricardo.villalobos@microsoft.com, brunoterkaly@hotmail.com and ricardovillalobos@hotmail.com. In these e-mail addresses, the domain names are microsoft.com and hotmail.com, while the e-mail names are bterkaly and ricardo.villalobos. Typical queries search first by domain name, then by e-mail name.
In this simple example, the choice of PartitionKey and RowKey are fairly straightforward. We’ll map the domain name to the PartitionKey and the e-mail name to the RowKey.
The code in Figure 2 should make things a bit clearer. It illustrates four simple capabilities:
- Defining an entity (EmailAddressEntity)
- Defining the table that will store the entities (EmailAddressTable)
- Inserting the entity into the table (insert EmailAddressEntity into EmailAddressTable)
- Querying the table to search for a specific entity (search for bterkaly@microsoft.com)
Figure 2 The Entity EmailAddressEntity
// Our entity derives from TableEntity
public class EmailAddressEntity : TableEntity
{
// Basic information that makes up our entity
public string EMailAddress { get; set; }
public string PhoneNumber { get; set; }
// A necessary default constructor
public EmailAddressEntity()
{
}
// A two-parameter constructor
public EmailAddressEntity(string email, string phone)
{
EMailAddress = email;
PhoneNumber = phone;
SetKeys(email, phone);
}
// A method that initializes the partition key and row key
public void SetKeys(string email, string phone)
{
int startIndex = email.IndexOf("@");
// Extract the mailname from the e-mail address
string mailname = email.Substring(0, startIndex);
// Extract the domain from the e-mail address
string domain = email.Substring(startIndex + 1);
// Perform the mandatory assignments to the partition key and row key
PartitionKey = domain;
RowKey = mailname;
PhoneNumber = phone;
}
}
First, we define the entity structure itself, EmailAddressEntity, as shown in Figure 2. The actual table (a container for entities) will be defined later, when we insert EmailAddressEntity into the table. An entity can be thought of as an individual object; it’s the smallest unit of data that can be stored in an Azure Table. As mentioned previously, an entity is a collection of typed name-value pairs, often referred to as properties. Tables are collections of entities, and each entity belongs to a table just as a row does in a relational database table. But tables in Azure Table Storage don’t have a fixed schema. There’s no requirement that all entities in a table be structurally identical, as is the case for a relational database table.
There are four main pieces of information in Figure 2. The first two, EMailAddress and PhoneNumber, are simply two strings we want to store. The other two are the properties PartitionKey and RowKey, which we discussed previously. A third property required of all entities is Timestamp, which is used internally by the system to facilitate optimistic concurrency.
The Timestamp column differs from the PartitionKey and RowKey columns because it’s automatically populated by the MAS system. In contrast, developers are required to insert into the PartitionKey and RowKey properties.
To summarize, the importance of PartitionKey and RowKey is mostly about query performance and transactional consistency. We explained query performance previously and it’s largely dependent on the way the data is partitioned across storage nodes. But PartitionKeys also allow you to make changes to multiple entities as part of the same operation, allowing developers to roll back changes should any single operation fail. The requirement is that entities are part of the same entity group, which really means that entities share the same PartitionKey. Transactions are supported within a single PartitionKey.
The code in Figure 3 illustrates instantiating an entity of type EmailAddressEntity (from Figure 2) and then inserting that entity into EmailAddressTable. Note that we’re using the local storage emulator. This lets us run and test our code and data locally without connecting to a datacenter.
Figure 3 Inserting an EmailAddressEntity
try
{
// Use the local storage emulator
var storageAccount = CloudStorageAccount.DevelopmentStorageAccount;
// Create a cloud table client object
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
// Create an e-mail address table object
CloudTable emailAddressTable =
tableClient.GetTableReference("EmailAddressTable");
// Create the table if it does not exist
// Only insert a new record once for this demo
if (emailAddressTable.CreateIfNotExists() == true)
{
// Create a new EmailAddressEntity entity
EmailAddressEntity emailaddress = new
EmailAddressEntity("bterkaly@microsoft.com", "555-555-5555");
// Create an operation to add the new e-mail and phone number to
// the emailAddressTable
TableOperation insertEmail = TableOperation.Insert(emailaddress);
// Submit the operation to the table service
emailAddressTable.Execute(insertEmail);
}
}
catch (Exception ex)
{
// Put the message in the Web page title (for testing purposes)
// Real error messages should go to a proper log file
this.Title = ex.Message.ToString();
throw;
}
You can view your data in the Server Explorer pane in Visual Studio 2012, as shown in Figure 4, which makes the process of writing and testing code much easier. You can also attach Server Explorer to a real instance of your Azure Tables in a datacenter.
Figure 4 Server Explorer
The code in Figure 5 illustrates how to query the data.
Figure 5 Querying Azure Tables
// Use the local storage emulator
var storageAccount = CloudStorageAccount.DevelopmentStorageAccount;
try
{
// Create the table client
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable emailAddressTable =
tableClient.GetTableReference("EmailAddressTable");
// Retrieve the entity with partition key of "microsoft.com"
// and row key of "bterkaly"
TableOperation retrieveBrunoEmail =
TableOperation.Retrieve<EmailAddressEntity>(
"microsoft.com", "bterkaly");
// Retrieve entity
EmailAddressEntity specificEntity =
(EmailAddressEntity)emailAddressTable.Execute(retrieveBrunoEmail).Result;
TableResult result =
emailAddressTable.Execute(TableOperation.Retrieve<EmailAddressEntity>(
"microsoft.com", "bterkaly"));
// Pull the data out that you searched for
// Do something with emailAddress and phoneNumber
string emailAddress = specificEntity.EMailAddress;
string phoneNumber = specificEntity.PhoneNumber;
}
catch (Exception ex)
{
// Put the message in the Web page title (for testing purposes)
// Real error messages should go to a proper log file
this.Title = ex.Message.ToString();
throw;
}
The code performs a simple query using PartitionKey and RowKey. Note that you can construct fairly complex queries using these filters because you can join them together in an ad hoc fashion. We build a query object using the combined filter. The final step is to simply execute the query and do whatever is needed with the EmailAddressEntity. The MAS Client Library greatly simplifies both the Create/Read/Update/Delete (CRUD) operations as well as the needed queries.
What’s Inside
We thought it might be helpful to take a slightly deeper look at the internal architecture of the MAS system, shown in Figure 6. Much of the following narrative is based on Brad Calder’s paper referenced later in this article.
Figure 6 Azure Storage Internals
MAS is composed of a series of storage stamps across its eight datacenters. A storage stamp is a cluster of about 10 to 20 racks of storage nodes. Each rack sits in a separate fault domain. Every rack comes with redundant networking and power. Each storage stamp contains approximately 30PBs of raw storage.
To keep costs low, it’s important to keep these storage stamps running above 70 percent utilization, which is measured in terms of capacity, transactions and bandwidth. Going above 90 percent is considered too high, though, as it leaves little headroom in the event of rack failures, when the system needs to do more with less.
Storage Location Service
The developer has no direct control over the Storage Location Service (SLS). At the account level, not only does the SLS map account namespaces across all stamps, it’s also responsible for disaster recovery, storage account allocation and load balancing. The SLS greatly simplifies the ability to add new storage in a datacenter. It can allocate new storage accounts to the new stamps for customers as well as load balance existing storage accounts from older stamps to the new stamps. All of these operations by the SLS are done automatically.
Let’s look a little closer at the three layers that make up a storage stamp—stream, partition and front end (FE)—starting from the bottom.
The stream layer provides an internal interface the partition layer uses to read and write large files, and is responsible for core replication functionality. The stream layer also handles opening, closing, deleting, renaming, reading, appending to and concatenating these large files. It doesn’t concern itself with the semantics of objects that are in the stream of data.
The partition layer provides the data model for the different types of objects stored (tables, blobs, queues); the logic and semantics to process the different types of objects; a massively scalable namespace for the objects; load balancing to access objects across the available partition servers; transaction ordering and strong consistency for access to objects; and the geo-replication of data objects from the primary to the secondary region.
The partition layer also encapsulates an important internal data structure called an Object Table. There are several versions of the Object Table, including the Entity Table, which stores all entity rows for all accounts in the stamp. It’s used to publicly expose Azure Table data abstraction. The object table also interacts with the partition layer to ensure data consistency by ordering transactions with blobs, tales and queues.
The FE layer is composed of a set of stateless servers that take incoming requests. Upon receiving a request, an FE looks up the AccountName, authenticates and authorizes the request, then routes the request to the appropriate partition server in the partition layer (based on the PartitionName). To enhance performance, the FE maintains and caches a Partition Map, so that routing to the appropriate partition server is expedited on frequently accessed data.
Wrapping Up
In this article, we’ve provided some high-level, actionable guidelines as well as some of the architectural details on how the MAS system is designed, and in particular how Azure Tables can help you manage your data. We’d like to thank Brad Calder for some of his insights shared in “Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency,” a recently published paper for the 23rd ACM Symposium on Operating Systems Principles (SOSP). You can download his paper at https://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf.
Azure Storage Client Library 2.0
Back in late October 2012, Microsoft released a new client-side storage library—Microsoft Azure Storage (MAS) Client Library 2.0—which dramatically improves usability, extensibility and performance when interacting with Azure Tables. You can install the MAS Client Library 2.0 with NuGet from bit.ly/YFeHuw. This can be done within Visual Studio 2012. For a detailed look at some of the great new features, visit bit.ly/VQSaUv.
The new library includes some new approaches that improve functionality with respect to usability, extensibility and performance. One nice feature saves you from the hassle of worrying about serialization and deserialization logic when working with Plain Old C# Objects (POCO). Another cool feature is the EntityResolver, which allows you to perform client-side projections, so you can create objects on the fly based on only the information you’re interested in. In short, you can convert directly from table entity data to a client object type without a separate table entity class type that deserializes every property individually. Another powerful technology is the IQueryable interface, which gives you an expressive way to define complex LINQ queries.
Bruno Terkaly is a developer evangelist for Microsoft. His depth of knowledge comes from years of experience in the field, writing code using a multitude of platforms, languages, frameworks, SDKs, libraries and APIs. He spends time writing code, blogging and giving live presentations on building cloud-based applications, specifically using the Azure platform. Terkaly is also the author of two Windows Store apps, Teach Kids Car Colors and Teach Kids Music. You can read his blog at blogs.msdn.com/brunoterkaly.
Ricardo Villalobos is a seasoned software architect with more than 15 years of experience designing and creating applications for companies in the supply chain management industry. Holding different technical certifications, as well as a master’s degree in business administration from the University of Dallas, he works as a cloud architect in the Azure CSV incubation group for Microsoft.
Terkaly and Villalobos jointly present at large industry conferences. They encourage readers to contact them for availability. Terkaly can be reached at bterkaly@microsoft.com and Villalobos can be reached at Ricardo.Villalobos@microsoft.com.
Thanks to the following technical experts for reviewing this article: Brad Calder (Microsoft) and Jai Haridas (Microsoft)