Hadoop on Azure : Introduction
Introduction
I am in complete awe on how this technology is resonating with today’s developers. If I invite developers for an evening event, Big Data is always a sellout.
This particular post is about getting everyone up to speed about what Hadoop is at a high level.
Big data is a technology that manages voluminous amount of unstructured and semi-structured data.
Relational databases fall short where data is very large or where the data doesn't map perfectly into a relational structure.
Big data is generally in the petabytes and exabytes of data.
- However, it is not just about the total size of data (volume)
- It is also about the velocity (how rapidly is the data arriving)
- What is the structure? Does it have variations?
Sources of Big Data
Science | Scientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. |
Sensors | Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. |
Social networks | I am thinking of Facebook, LinkedIn, Yahoo, Google |
Social influencers | Blog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etc |
Log files | Computer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliances |
Public Data Stores | Microsoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDb |
Data warehouse appliances | Teradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis |
Network and in-stream monitoring technologies | Packets in TCP/IP, email, etc |
Legacy documents | Archives of statements, insurance forms, medical record and customer correspondence |
Two problems to solve
Storage Problem | How do I store a petabyte of data reliably? Afterall, a petabyte is over 333 three TB drives. |
Money Problem | 1 petabyte costs a lot. For just 70 TB you will pay over $100,000. (eBay ad Dell/EMC CLARiiON CX3-40 -70TB- FAST 4G 15K SAN Storage is only 70 TB for $112,000) |
Two seminal papers
The Google File System | It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients | https://research.google.com/archive/gfs.html |
MapReduce: Simplified Data Processing on Large Clusters | MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key | https://research.google.com/archive/mapreduce.html |
Hadoop : What is it?
- Hadoop is an open-source software framework that supports data-intensive distributed applications. Hadoop is written in Java.
- I met its creator, Doug Cutting, who was working at Yahoo at the time. Hadoop is named after his son's toy elephant. I was hosting a booth at the time, and I remember Doug was curious about finding some cool stuff to bring home from the booth to give to his son. Another great idea, Doug!
- One of the goals of Hadoop is to run applications on large clusters of commodity hardware. The cluster is composed of a single master and multiple worker nodes.
- Hadoop leverages the the programming model of map/reduce. It is optimized for processing large data sets.
- MapReduce is typically used to do distributed computing on clusters of computer. A cluster had many “nodes,” where each node is a computer in a cluster.
- The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various slave or worker nodes in the cluster, and process the the data in parallel. Hadoop leverages a distributed file system to store the data on various nodes.
It is about two functions
Hadoop comes down to two functions. As long as you can write the map() and reduce() function, your data type is supported, whether we are talking abuot (1) text files (2) xml files (3) json files (4) even graphics, sound or video files.
The core is map() and reduce() | ||||
---|---|---|---|---|
Understanding these methods is the key to mastering Hadoop
|
Map Reduce Code | |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 | public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }} |
The “map” in MapReduce
- There is a master node and many slave nodes.
- The master node takes the input, divides it into smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure.
- The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node.
- Each mapping operation is independent of the others, all maps can be performed in parallel.
The “reduce” in MapReduce
- The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve.
- Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.
Their are 3 key methods
The map() function will generate a list of key/value pairs based on the data |
The shuffle() phase will bring things together for the reduce() phase |
The reduce() phase will take the list of key/value pairs and hand that to you to do something with. |
- The Hello World sample for Hadoop is a word count example.
- Let's assume our quote is this:
- It is time for all good men to come to the aid of their country.
map() function (see the "to" part) finds "to" twice (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1) (men, 1) (to, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1) shuffle() function (see the "to" part) creates (to, 1, 1) (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1, 1) (men, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1) reduce() function (see the "to" part) creates (to, 2) (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (men, 1) (to, 2) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)
High-level Architecture
- There are two main layers to both the master node and the slave nodes – the MapReduce layer and the Distributed File System Layer. The master node is responsible for mapping the data to slave or worker nodes.
Hadoop is a platform
Hadoop Common | The common utilities that support the other Hadoop modules. |
Hadoop Distributed File System (HDFS) | A distributed file system that provides high-throughput access to application data. |
Hadoop YARN | A framework for job scheduling and cluster resource management. |
Hadoop MapReduce | A YARN-based system for parallel processing of large data sets. |
There also related modules that are commonly associated with Hadoop.
Apache Pig | A platform for analyzing large data sets.
It includes a high-level language for expressing data analysis programs A key point of Pig programs is that they support substantial parallelization Pig consists of a compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin |
||||||
Hive | Hive is a data warehouse system for Hadoop.It provides a SQL-like language.It helps with data summarization and ad-hoc queries. I am not sure yet whether this is required with Hadoop on Azure. If not, just a few command line tasks to do:
|
I signed up
I recently signed up for the Windows Azure HDInsight Service here https://www.hadooponazure.com/.
Logging in
After logging in, you will be presented with this screen:
Next post : Calculate PI with Hadoop
- We will create a job name called “Pi Example.” This very simple sample will calculate PI using a cluster of comptuers.
- This is not necessarily the best example of big data, it is more of a compute problem.
- The final command line will look like this:
- Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000
- More details on this sample coming soon.