What is big data?

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Do you know what visitors to your website really think about your carefully crafted content? Or, if you run a business, can you tell what your customers actually think about your products or services? Did you realize that your latest promotional campaign had the biggest effect on people aged between 40 and 50 living in Wisconsin, USA (and, more importantly, why)?

Being able to get answers to these kinds of questions is increasingly vital in today's competitive environment, but the source data that can provide these answers is often hidden away; and when you can find it, it's often very difficult to analyze. It might be distributed across many different databases or files, be in a format that is hard to process, or may even have been discarded because it didn’t seem useful at the time.

To resolve these issues, data analysts and business managers are fast adopting techniques that were commonly at the core of data processing in the past, but have been sidelined in the rush to modern relational database systems and structured data storage. The new buzzword is “big data” and the associated solutions encompass a range of technologies and techniques that allow you to extract real, useful, and previously hidden information from the often very large quantities of data that previously may have been left dormant and, ultimately, thrown away because storage was too costly.

The term “big data” is being used to describe an increasing range of technologies and techniques. In essence, big data is data that is valuable but, traditionally, it was not practical to store or analyze it due to limitations of cost or the absence of suitable mechanisms. Big data typically refers to collections of datasets that, due to size and complexity, are difficult to store, query, and manage using existing data management tools or data processing applications.

You can also think of big data as data, often produced at “fire hose” rate, that you don't know how to analyze at the moment—but which may provide valuable information in the future. Big data solutions aim to provide data storage and querying functionality for situations such as this. They offer a mechanism for organizations to extract meaningful, useful, and often vital information from the vast stores of data they are collecting.

Big data is often described as a solution to the “three V's problem”:

  • Volume: Big data solutions typically store and query hundreds of terabytes of data, and the total volume is probably growing by ten times every five years. Storage must be able to manage this volume, be easily expandable, and work efficiently across distributed systems. Processing systems must be scalable to handle increasing volumes of data, typically by scaling out across multiple machines.
  • Variety: It's not uncommon for new data to not match any existing data schema. It may also be semi-structured or unstructured data. This means that applying schemas to the data before or during storage is no longer a practical proposition.
  • Velocity: Data is being collected at an increasing rate from many new types of devices, from a fast-growing number of users, and from an increasing number of devices and applications per user. The design and implementation of storage must be able to manage this efficiently, and processing systems must be able to return results within an acceptable timeframe.

The quintessential aspect of big data is not the data itself; it’s the ability to discover useful information hidden in the data. Big data is not just Hadoop—solutions may use traditional data management systems such as relational databases and other types of data store. It’s really all about the analytics that a big data solution can empower.

This section of the guide explores some of the basic features of big data solutions. If you are not familiar with the concepts of big data, when it is useful, and how it works, you will find the following topics helpful:

Next Topic | Previous Topic | Home | Community