Summary

Completed
  • Apache HBase is an open-source version of Google's BigTable distributed storage system. Both systems are distributed, scalable, high-performance, versioned databases.
  • HBase organizes data into tables, with each row of a table being indexed by a key, and data are stored in columns. Multiple columns can be grouped together into column families.
  • HBase columns do not need to be typed; columns can be added on demand. HBase stores data on a disk columnwise, making it a columnar database.
  • Data stored in HBase rows are versioned; by default, operations apply to the newest version of the data.
  • HBase supports the Get, Put, Scan, and Delete operations.
  • HBase is organized as a cluster of nodes; one node is designated the master, and the others are known as regionservers. ZooKeeper is used to manage the cluster. HDFS is typically used to store data in HBase.
  • Data in a table are partitioned into regions and assigned individual regionservers; the HBase master keeps a record of the region assignments.
  • HBase is not fully ACID compliant, particularly with operations that span rows. In such cases, subsequent requests for data from HBase can return stale data.
  • HBase is best suited for big-data storage and supports fast access to multiple rows for aggregations. Multiple interfaces allow for connectivity to MapReduce and to Web applications.
  • HBase does not support joins, and its consistency model must be considered while designing applications.
  • MongoDB is a document store that stores documents in collections.
  • MongoDB stores data internally using the Binary JSON (BSON) format.
  • MongoDB can be scaled to multiple clusters using replication and sharding.
  • MongoDB is popular for applications that require scale-out, have the need for fast, bulk writes, as well as data that needs to have geo-spatial indices.
  • Apache Cassandra is a fully distributed, structured key-value storage system, which uses multiple design aspects of HBase and Dynamo.
  • A table in Cassandra is known as a column family, with each record indexed by a key and consisting of columns; multiple columns can be grouped together into a super column.
  • A row in Cassandra is returned as a sequence of nested key-value pairs.
  • Typical operations in Cassandra include Gets, Inserts, and Deletes. They can be performed on individual rows, on groups of rows (a range), and on a group of columns (a slice).
  • Cassandra runs on a cluster of nodes using decentralized, peer-to-peer architecture. Cassandra uses the local storage on each node to store data.
  • Cassandra nodes are arranged into a token ring, and Cassandra automatically distributes rows among the various nodes in the cluster using the hash value of the key of each row through consistent hashing. Every node is aware of all other nodes in the token ring and automatically forwards client requests to the correct node.
  • Cassandra replicates rows across nodes on the basis of a user-defined replication factor.
  • Every operation in Cassandra is performed with a user-defined consistency level. The consistency level provides a tunable tradeoff of consistency versus performance for every operation.
  • Specialized algorithms in Cassandra are utilized to handle failure detection; Cassandra nodes can temporarily keep track of write requests to offline nodes and forward them when the node comes back online (hinted handoff).
  • Cassandra's ACID properties are configurable on a per-operation basis.
  • Cassandra is popular because of its feature set, and its tunable consistency model provides a great deal of flexibility in designing applications.
  • OpenStack Swift is an open-source object storage system that be deployed in public or private clouds.
  • Swift proves an S3-like REST interface to access objects.
  • Ceph object gateway is a access layer over the RADOS distributed object store. This offers both S3 and SWIFT compatible interfaces into RADOS.