Processing data with Pig
From: Developing big data solutions on Microsoft Azure HDInsight
Pig Latin syntax has some similarities to LINQ, and encapsulates many functions and expressions that make it easy to create a sequence of complex data transformations with just a few lines of simple code. Pig Latin is a good choice for creating relations and manipulating sets, and for working with unstructured source data. You can always create a Hive table over the results of a Pig Latin query if you want table format output. However, the syntax of Pig Latin can be complex for non-programmers to master. Pig Latin is not as familiar or as easy to use as HiveQL, but Pig can achieve some tasks that are difficult, or even impossible, when using Hive.
You can run Pig Latin statements interactively in the Hadoop command line window or in a command line Pig shell named Grunt. You can also combine a sequence of Pig Latin statements in a script that can be executed as a single job, and use user-defined functions you previously uploaded to HDInsight. The Pig Latin statements are used by the Pig interpreter to generate jobs, but the jobs are not actually generated and executed until you call either a DUMP statement (which is used to display a relation in the console, and is useful when interactively testing and debugging Pig Latin code) or a STORE statement (which is used to store a relation as a file in a specified folder).
Pig scripts generally save their results as text files in storage, where they can easily be viewed on demand, perhaps by using the Hadoop command line window. However, the results can be difficult to consume or processes in client applications unless you copy the output files and import them into client tools such as Excel.
Executing a Pig script
As an example of using Pig, suppose you have a tab-delimited text file containing source data similar to the following.
Value1 1
Value2 3
Value3 2
Value1 4
Value3 6
Value1 2
Value2 8
Value2 5
You could process the data in the source file with the following simple Pig Latin script.
A = LOAD '/mydata/sourcedata.txt' USING PigStorage('\t') AS (col1, col2:long);
B = GROUP A BY col1;
C = FOREACH B GENERATE group, SUM(A.col2) as total;
D = ORDER D BY total;
STORE D INTO '/mydata/results';
This script loads the tab-delimited data into a relation named A imposing a schema that consists of two columns: col1, which uses the default byte array data type, and col2, which is a long integer. The script then creates a relation named B in which the rows in A are grouped by col1, and then creates a relation named C in which the col2 value is aggregated for each group in B.
After the data has been aggregated, the script creates a relation named D in which the data is sorted based on the total that has been generated. The relation D is then stored as a file in the /mydata/results folder, which contains the following text.
Value1 7
Value3 8
Value2 16
Note
For more information about Pig Latin syntax, see Pig Latin Reference Manual 2 on the Apache Pig website. For a more detailed description of using Pig see Pig Tutorial.