Hadoop on Azure : Introduction

Article
11/16/2012

Introduction

I am in complete awe on how this technology is resonating with today’s developers. If I invite developers for an evening event, Big Data is always a sellout.

This particular post is about getting everyone up to speed about what Hadoop is at a high level.

Big data is a technology that manages voluminous amount of unstructured and semi-structured data.

Relational databases fall short where data is very large or where the data doesn't map perfectly into a relational structure.

Big data is generally in the petabytes and exabytes of data.

However, it is not just about the total size of data (volume)
It is also about the velocity (how rapidly is the data arriving)
What is the structure? Does it have variations?

Sources of Big Data

Science	Scientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research.
Sensors	Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.
Social networks	I am thinking of Facebook, LinkedIn, Yahoo, Google
Social influencers	Blog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etc
Log files	Computer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliances
Public Data Stores	Microsoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDb
Data warehouse appliances	Teradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis
Network and in-stream monitoring technologies	Packets in TCP/IP, email, etc
Legacy documents	Archives of statements, insurance forms, medical record and customer correspondence

Two problems to solve

Storage Problem	How do I store a petabyte of data reliably? Afterall, a petabyte is over 333 three TB drives.
Money Problem	1 petabyte costs a lot. For just 70 TB you will pay over $100,000. (eBay ad Dell/EMC CLARiiON CX3-40 -70TB- FAST 4G 15K SAN Storage is only 70 TB for $112,000)

Two seminal papers

The Google File System	It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients	https://research.google.com/archive/gfs.html
MapReduce: Simplified Data Processing on Large Clusters	MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key	https://research.google.com/archive/mapreduce.html

Hadoop : What is it?

Hadoop is an open-source software framework that supports data-intensive distributed applications. Hadoop is written in Java.
I met its creator, Doug Cutting, who was working at Yahoo at the time. Hadoop is named after his son's toy elephant. I was hosting a booth at the time, and I remember Doug was curious about finding some cool stuff to bring home from the booth to give to his son. Another great idea, Doug!
One of the goals of Hadoop is to run applications on large clusters of commodity hardware. The cluster is composed of a single master and multiple worker nodes.
Hadoop leverages the the programming model of map/reduce. It is optimized for processing large data sets.
MapReduce is typically used to do distributed computing on clusters of computer. A cluster had many “nodes,” where each node is a computer in a cluster.
The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various slave or worker nodes in the cluster, and process the the data in parallel. Hadoop leverages a distributed file system to store the data on various nodes.

It is about two functions

Hadoop comes down to two functions. As long as you can write the map() and reduce() function, your data type is supported, whether we are talking abuot (1) text files (2) xml files (3) json files (4) even graphics, sound or video files.

Understanding these methods is the key to mastering Hadoop

Map Step	The map step is all about dividing the problem into smaller sub-problems. A master node has the job of distributing the work to worker nodes. The worker node just does one thing and returns the work back to the master node.
Reduce Step	Once the master gets the work from the worker nodes, the reduce() step takes over and combines all the work. By combining the work you can form some answer and ultimately output.

Map Reduce Code

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465

public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }}

The “map” in MapReduce

There is a master node and many slave nodes.
The master node takes the input, divides it into smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure.
The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node.
Each mapping operation is independent of the others, all maps can be performed in parallel.

The “reduce” in MapReduce

The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve.
Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.

Their are 3 key methods

The map() function will generate a list of key/value pairs based on the data

The shuffle() phase will bring things together for the reduce() phase

The reduce() phase will take the list of key/value pairs and hand that to you to do something with.

The Hello World sample for Hadoop is a word count example.

Let's assume our quote is this:

It is time for all good men to come to the aid of their country.

map() function (see the "to" part) finds "to" twice	(It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1) (men, 1) (to, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)
shuffle() function (see the "to" part) creates (to, 1, 1)	(It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1, 1) (men, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)
reduce() function (see the "to" part) creates (to, 2)	(It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (men, 1) (to, 2) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)

High-level Architecture

There are two main layers to both the master node and the slave nodes – the MapReduce layer and the Distributed File System Layer. The master node is responsible for mapping the data to slave or worker nodes.

Hadoop is a platform

Hadoop Common	The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS)	A distributed file system that provides high-throughput access to application data.
Hadoop YARN	A framework for job scheduling and cluster resource management.
Hadoop MapReduce	A YARN-based system for parallel processing of large data sets.

There also related modules that are commonly associated with Hadoop.

Apache Pig

A platform for analyzing large data sets.

It includes a high-level language for expressing data analysis programs

A key point of Pig programs is that they support substantial parallelization

Pig consists of a compiler that produces sequences of Map-Reduce programs

Pig's language layer currently consists of a textual language called Pig Latin

Hive

Hive is a data warehouse system for Hadoop.It provides a SQL-like language.It helps with data summarization and ad-hoc queries. I am not sure yet whether this is required with Hadoop on Azure. If not, just a few command line tasks to do:

Install Hive	tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME	$ cd hive-x.y.z/ $ export HIVE_HOME={{pwd}}
Add $HIVE_HOME/bin to your PATH:	$ export PATH=$HIVE_HOME/bin:$PATH

But from what I saw here, looks like there is an ODBC Hive Setup Module. WehnMing Ye has this video: https://channel9.msdn.com/Events/windowsazure/learn/Hadoop-on-Windows-Azure

I signed up

I recently signed up for the Windows Azure HDInsight Service here https://www.hadooponazure.com/.

Logging in

After logging in, you will be presented with this screen:

Next post : Calculate PI with Hadoop

We will create a job name called “Pi Example.” This very simple sample will calculate PI using a cluster of comptuers.
This is not necessarily the best example of big data, it is more of a compute problem.
The final command line will look like this:
1. Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000
More details on this sample coming soon.

Partager via

Hadoop on Azure : Introduction

Ressources supplémentaires