Scenario 1: Iterative exploration

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Watch video

A common big data analysis scenario is to explore data iteratively, refining the processing techniques used until you discover something interesting or find the answers you seek. In this example of the iterative exploration use case and model, HDInsight is used to perform some basic analysis of data from Twitter.

Social media analysis is a common big data use case, and this example demonstrates how to extract information from semi-structured data in the form of tweets. However, the goal of this scenario is not to provide a comprehensive guide to analyzing data from Twitter, but rather to show an example of iterative data exploration with HDInsight and demonstrate some of the techniques discussed in this guide. Specifically, this scenario describes:

Finding insights in data

Previously in this guide you saw that one of the typical uses of a big data solution such as HDInsight is to explore data that you already have, or data you collect speculatively, to see if it can provide insights into information that you can use within your organization. The decision flow shown in Figure 1 is an example of how you might start with a guess based on intuition, and progress towards a repeatable solution that you can incorporate into your existing BI systems. Or, perhaps, to discover that there is no interesting information in the data, but the cost of discovering this has been minimized by using a “pay for what you use” mechanism that you can set up and then tear down again very quickly and easily.

Figure 1 - The iterative exploration cycle for finding insights in data

Figure 1 - The iterative exploration cycle for finding insights in data

Introduction to Blue Yonder Airlines

This example is based on a fictitious company named Blue Yonder Airlines. The company is an airline serving passengers in the USA, and operates flights from its home hub at JFK airport in New York to Sea-Tac airport in Seattle and LAX in Los Angeles. The company has a customer loyalty program named Blue Yonder Points.

Some months ago, the CEO of Blue Yonder Airlines was talking to a colleague who mentioned that he had seen tweets that were critical of the company’s seating policy. The CEO decided that the company should investigate this possible source of valuable customer feedback, and instructed her customer service department to start using Twitter as a means of communicating with its customers.

Customers send tweets to @BlueYonderAirlines and use the standard Twitter convention of including hashtags to denote key terms in their messages. In order to provide a basis for analyzing these tweets, the CEO also asked the BI team to start collecting any that mention @BlueYonderAirlines.

Analytical goals and data sources

Initially the plan is simply to collect enough data to begin exploring the information it contains. To determine if the results are both useful and valid, the team must collect enough data to generate a statistically valid result. However, they do not want to invest significant resources and time at this point, and so use a manual interactive process for collecting the data from Twitter using the public API.

Note

Even though they don’t know what they are looking for at this stage, the managers still have an analytical goal—to explore the data and discover if, as they are led to believe, it really does contains useful information that could provide a benefit for their business.

The Twitter data that has been captured consists of tab-delimited text files in the following format:

4/16/2013 http://twitter.com/CameronWhite/statuses/123456789 CameronWhite (Cameron White) terrible trip @blueyonderairlines - 
missed my connection because of a delay :( #SEATAC
4/16/2013 http://twitter.com/AmelieWilkins/statuses/123456790 AmelieWilkins (Amelie Wilkins) terrific journey @blueyonderairlines
- favorite movie on in-flight entertainment! #JFK_Airport
4/16/2013 http://twitter.com/EllaWilliamson/statuses/123456791 EllaWilliamson (Ella Williamson) lousy experience 
@blueyonderairlines - 2 hour delay! #SEATAC_Airport
4/16/2013 http://twitter.com/BarbaraWilson/statuses/123456792 BarbaraWilson (Barbara Wilson) fantastic time @blueyonderairlines
- great film on my seat back screen :) #blueyonderpoints
4/16/2013 http://twitter.com/WilliamWoodward/statuses/123456793 WilliamWoodward (William Woodward) dreadful voyage
@blueyonderairlines - entertainment system and onboard wifi not working! #IHateFlying

Although there is no specific business decision under consideration, the customer services managers believe that some analysis of the tweets sent by customers may reveal important information about how they perceive the airline and the issues that matter to customers. The kinds of question the team expects to answer are:

  • Are people talking about Blue Yonder Airlines on Twitter?
  • If so, are there any specific topics that regularly arise?
  • Of these topics, if any, is it possible to get a realistic view of which are the most important?
  • Does the process provide valid and useful information? If not, can it be refined to produce more accurate and useful results?
  • If the results are valid and useful, can the process be made repeatable?

Collecting and uploading the source data

The data analysts at Blue Yonder Airlines have been collecting data from Twitter and uploading it to Azure blob storage ready to begin the investigation. They have gathered sufficient data over a period of weeks to ensure a suitable sample for analysis. To upload the source data files to Azure storage, the data analysts used the following Windows PowerShell script:

$storageAccountName = "storage-account-name"
$containerName = "container-name"

$localFolder = "D:\Data\Tweets"
$destfolder = "tweets"

$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

$files = Get-ChildItem $localFolder
foreach($file in $files){
  $fileName = "$localFolder\$file"
  $blobName = "$destfolder/$file"
  write-host "copying $fileName to $blobName"
  Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $destContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"

The source data and results will be retained in Azure blob storage for visualization and further exploration in Excel after the investigation is complete.

Note

If you are just experimenting with data so see if it useful, you probably won’t want to spend inordinate amounts of time and resources building a complex or automated data ingestion mechanism. Often it easier and quicker to just use a simple PowerShell script. For details of other options for ingesting data see Collecting and loading data into HDInsight.

The HDInsight infrastructure

Blue Yonder Airlines does not have the capital expenditure budget to purchase new servers for the project, especially as the hardware may only be required for the duration of the project. Therefore, the most appropriate infrastructure approach is to provision an HDInsight cluster on Azure that can be released when no longer required, minimizing the overall cost of the project.

To further minimize cost the team began with a single node HDInsight cluster, rather than accepting the default of four nodes when initially provisioning the cluster. However, if necessary the team can provision a larger cluster should the processing load indicate that more nodes are required. At the end of the investigation, if the results indicate that valuable information can be extracted from the data, the team can fine tune the solution to use a cluster with the appropriate number of nodes.

Note

For details of the cost of running an HDInsight cluster, see HDInsight Pricing Details. In addition to the cost of the cluster you must pay for a storage account to hold the data for the cluster.

Iterative data processing

After capturing a suitable volume of data and uploading it to Azure blob storage the analysts can configure an HDInsight cluster associated with the blob container holding the data, and then begin processing it. In the absence of a specific analytical goal, the data processing follows an iterative pattern in which the analysts explore the data to see if they find anything that indicates specific topics of interest for customers, and then build on what they find to refine the analysis.

At a high level, the iterative process breaks down into three phases:

  • Explore – the analysts explore the data to determine what potentially useful information it contains.
  • Refine – when some potentially useful data is found, the data processing steps used to query the data are refined to maximize the analytical value of the results.
  • Stabilize – when a data processing solution that produces useful analytical results has been identified, it is stabilized to make it robust and repeatable.

Although many big data solutions will be developed using the stages described here, it’s not mandatory. You may know exactly what information you want from the data, and how to extract it. Alternatively, if you don’t intend to repeat the process, there’s no point in refining or stabilizing it.

Next Topic | Previous Topic | Home | Community