Share via


Phase 1: Initial data exploration

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

In the first phase of the analysis the team at Blue Yonder Airlines decided to explore the data speculatively by coming up with a hypothesis about what information the data might be able to reveal, and using HDInsight to process the data and generate results that validate the hypothesis. The goal for this phase is open-ended; the exploration might result in a specific avenue of investigation that merits further refinement of a particular data processing solution, or it might simply prove (or disprove) an assumption about the data.

Note

An alternative technique from that described here might be to use the more recent capabilities of Hive to estimate frequency distribution. For more details see Statistics and Data Mining in Hive.

Using Hive to explore the volume of tweets

In most cases the simplest way to start exploring data with HDInsight is to create a Hive table, and then query it with HiveQL statements. The analysts at Blue Yonder Airlines created a Hive table based on the source data obtained from Twitter. The following HiveQL code was used to define the table and load the source data into it.

CREATE EXTERNAL TABLE Tweets (PubDate DATE, TweetID STRING, Author STRING, Tweet STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/tweets';

LOAD DATA INPATH '/tweets' INTO TABLE Tweets;

An external table was used so that the table can be dropped without deleting the data, and recreated as the analysis continues.

The analysts hypothesized that the use of Twitter to communicate with the company is significant, and that the volume of tweets that mention the company is growing. They therefore used the following query to determine the daily volume and trend of tweets.

SELECT PubDate, COUNT(*) TweetCount FROM Tweets GROUP BY PubDate SORT BY PubDate;

The results of this query are shown in the following table.

PubDate

TweetCount

4/16/2013

1964

4/17/2013

2009

4/18/2013

2058

4/19/2013

2107

4/20/2013

2160

4/21/2013

2215

4/22/2013

2274

These results seem to validate the hypothesis that the volume of tweets is growing. It may be worth refining this query to include a larger set of source data that spans a longer time period, and potentially include other aggregations in the results such as the number of distinct authors that tweeted each day. However, while this analytical approach might reveal some information about the importance of Twitter as a channel for customer communication, it doesn’t provide any information about the specific topics that concern customers. To determine what’s important to the airline’s customers, the analysts must look more closely at the actual contents of the tweets.

Using map/reduce code to identify common terms

The analysts at Blue Yonder Airlines hypothesized that analysis of the individual words used in tweets addressed to the airline’s account might reveal topics that are important to customers. The team decided to examine the individual words in the data, count the number of occurrences of each word, and from this determine the main topics of interest.

Parsing the unstructured tweet text and identifying discrete words can be accomplished by implementing a custom map/reduce solution. However, HDInsight includes a sample map/reduce solution called WordCount that counts the number of words in a text source. The sample is provided in various forms including Java, JavaScript, and C#.

Note

There are many ways to create and execute map/reduce functions, though often you can use Hive or Pig directly instead of resorting to writing map/reduce code.

The code consists of a map function that runs on each cluster node in parallel and parses the text input to create a key/value pair with the value of 1 for every word found. These key/value pairs are passed to the reduce function, which counts the total number of instances of each key. Therefore, the result is the number of times each word was mentioned in the source data. An extract of the Java source code is shown below.

public class WordCount {

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
  {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) 
                    throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) { 
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context)
                       throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) { sum += val.get(); }
      result.set(sum);
      context.write(key, result);
    }
  }
  ...
}

Note

For information about writing Java map/reduce code for HDInsight see Develop Java MapReduce programs for HDInsight. For more details of the Java classes used when creating map/reduce functions see Understanding MapReduce.

The Java code, compiled to a .jar file, can be executed using the following PowerShell script.

$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$jarFile = "wasbs://$containerName@$storageAccountName.blob.core.windows.net/example/jars/hadoop-mapreduce-examples.jar"
$input = "wasbs://$containerName@$storageAccountName.blob.core.windows.net/twitterdata/tweets"
$output = "wasbs://$containerName@$storageAccountName.blob.core.windows.net/twitterdata/words"

$jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile $jarFile 
          -ClassName "wordcount" -Arguments $input , $output

$wordCountJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef

Write-Host "Map/Reduce job submitted..."

Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId -StandardError

Note

For more information about running map/reduce jobs in HDInsight see Building custom clients in the topic Processing, querying, and transforming data using HDInsight.

The job generates a file named part-r-00000 containing the total number of instances of each word in the source data. An extract from the results is shown here.

http://twitter.com/<user_name>/statuses/12347297    1
http://twitter.com/<user_name>/statuses/12347149    1
in      1408
in-flight       1057
incredible      541
is      704
it      352
it!     352
job     1056
journey 1057
just    352
later   352
lost    704
lots    352
lousy   515
love    1408
lugage? 352
luggage 352
made    1056
...

Unfortunately, these results are not particularly useful in trying to identify the most common topics discussed in the tweets because the words are not ordered by frequency, and the list includes words derived from Twitter names and other fields that are not actually a part of the tweeted messages.

Using Pig to group and summarize word counts

It would be possible to modify the Java code to overcome the limitations identified. However, it may be simpler (and quicker) to use a higher-level abstraction such as Pig to filter, count, and sort the words. Pig provides a workflow-based approach to data processing that is ideal for restructuring and summarizing data. A Pig Latin script that performs the aggregation is syntactically much easier to create than implementing the equivalent custom map/reduce code.

Note

Pig makes it easy to write procedural workflow-style code that builds a result by repeated operations on a dataset. It also makes it easier to debug queries and transformations because you can dump the intermediate results of each operation in the script to a file.

The Pig Latin code created by the analysts is shown below. It uses the source data files in the /twitterdata/tweets folder to group the number of occurrences of each word, and then sorts the results in descending order of occurrences and stores the first 100 results in the /twitterdata/wordcounts folder.

-- load tweets.
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);

-- split tweet into words.
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;

-- clean words by removing punctuation.
CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;

-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';

-- group by word.
GroupedWords = GROUP FilteredWords BY (word);

-- count mentions per group.
CountedWords = FOREACH GroupedWords GENERATE group, COUNT(FilteredWords) as count;

-- sort by count.
SortedWords = ORDER CountedWords BY count DESC;

-- limit results to the top 100.
Top100Words = LIMIT SortedWords 100;

-- store the results as a file.
STORE Top100Words INTO '/twitterdata/wordcounts'; 

This script is saved as WordCount.pig, uploaded to Azure storage, and executed in HDInsight using the following Windows PowerShell script.

$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$localfolder = "D:\Data\Scripts"
$destfolder = "twitterdata/scripts"
$scriptFile = "WordCount.pig"
$outputFolder = "twitterdata/wordcounts"
$outputFile = "part-r-00000"

# Upload Pig Latin script to Azure. 
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$scriptFile"
$filename = "$localfolder\$scriptFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $blobContext -Force
write-host "$scriptFile uploaded to $containerName!"

# Run the Pig Latin script.
$jobDef = New-AzureHDInsightPigJobDefinition 
  -File "wasbs://$containerName@$storageAccountName.blob.core.windows.net/$destfolder/$scriptFile"
$pigJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Write-Host "Pig job submitted..."
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError

# Get the job output.
$remoteblob = "$destfolder/$outputFolder/$outputFile"
write-host "Downloading $remoteBlob..."
Get-AzureStorageBlobContent -Container $containerName -Blob $remoteblob -Context $blobContext -Destination $localFolder
cat $localFolder\$destfolder\$outputFolder\$outputFile

When the script has completed successfully, the results are stored in a file named part-r-00000 in the /twitterdata/wordcounts folder. This can be downloaded and viewed using the Hadoop cat command. The following is an extract of the results.

my            3437
delayed       2749
flight        2749
to            2407
entertainment 2064
the           2063
a             2061
delay         1720
of            1719
bags          1718

These results show that the word count approach has the potential to reveal some insights. For example, the high number of occurrences of delayed and delay are likely to be relevant in determining common customer concerns. However, the solution needs to be modified to restrict the output to include only significant words, which will improve its usefulness. To accomplish this the analysts decided to refine it to produce accurate and meaningful insights into the most common words used by customers when communicating with the airline by Twitter. This is described in Phase 2: Refining the solution.

Next Topic | Previous Topic | Home | Community