Phase 3: Stabilizing the solution
From: Developing big data solutions on Microsoft Azure HDInsight
With the data processing solution refined to produce useful insights, the analysts at Blue Yonder Airlines decided to stabilize the solution to make it more robust and repeatable. Key concerns about the solution include the dependencies on hard-coded file paths, the data schemas in the Pig Latin scripts, and the data ingestion process.
With the current solution, any changes to the location or format of the tweets.txt, noisewords.txt, or synonyms.txt files would break the current scripts. Such changes are particularly likely to occur if the solution gains acceptance among users and Hive tables are created on top of the files to provide a more convenient query interface.
Using HCatalog to create Hive tables
HCatalog provides an abstraction layer between data processing tools such as Hive, Pig, and custom map/reduce code and the physical storage of the data. By using HCatalog to manage data storage the analysts hoped to remove location and format dependencies in Pig Latin scripts, while making it easy for users to access the analytical data directly though Hive.
The analysts had already created a table named Tweets for the source data, and now decided to create tables for the noise words file and the synonyms file. Additionally, they decided to create a table for the results generated by the WordCount.pig script to make it easier to perform further queries against the aggregated word counts. To accomplish this the analysts created the following script file containing HiveQL statements, which they saved as CreateTables.hcatalog on the HDInsight server.
CREATE EXTERNAL TABLE NoiseWords (noiseword STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/noisewords';
CREATE EXTERNAL TABLE Synonyms (leadingvalue STRING, synonym STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/synonyms';
CREATE EXTERNAL TABLE TopWords (word STRING, count BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/topwords';
To execute this script with HCatalog, the following command was executed in the Hadoop command window on the cluster.
%HCATALOG_HOME%\bin\hcat.py –f C:\Scripts\CreateTables.hcatalog
After the script had successfully created the new tables, the noisewords.txt and synonyms.txt files were moved into the folders used by the corresponding tables.
Using HCatalog in a Pig Latin script
Moving the noisewords.txt and synonyms.txt files made the data available by querying the corresponding Hive tables; but broke the WordCount.pig script, which includes a hard-coded path to these files. The script also uses hard-coded paths to load the tweets.txt file from the folder used by the Tweets table, and to save the results.
This dependency on hard-coded paths makes the data processing solution vulnerable to changes in the way that data is stored. One of the major advantages of using HCatalog is that you can relocate and redefine data as required without breaking all the scripts and code that accesses the files. For example, if at a later date an administrator modifies the Tweets table to partition the data, the code to load the source data would no longer work. Additionally, if the source data was modified to use a different format or schema, the script would need to be modified accordingly.
To eliminate the dependency, the analysts modified the WordCount.pig script to use HCatalog classes to load and save data in Hive tables instead of accessing the source files directly. The modified sections of the script are shown below.
-- load tweets using HCatalog.
Tweets = LOAD 'Tweets' USING org.apache.hcatalog.pig.HCatLoader();
...
-- load the noise words file using HCatalog.
NoiseWords = LOAD 'NoiseWords' USING org.apache.hcatalog.pig.HCatLoader();
...
-- Match synonyms using data loaded through HCatalog
Synonyms = LOAD 'Synonyms' USING org.apache.hcatalog.pig.HCatLoader();
...
-- store the results as a file using HCatalog
STORE Top100Words INTO 'TopWords' USING org.apache.hcatalog.pig.HCatStorer();
The script no longer includes any hard-coded paths to data files or schemas for the data as it is loaded or stored, and instead uses HCatalog to reference the Hive tables created previously. The results of the script are stored in the TopWords table, and can be viewed by executing a HiveQL query such as the following example.
SELECT * FROM TopWords;
Finalizing the data ingestion process
Having stabilized the solution, the team finally examined ways to improve and automate the way that data is extracted from Twitter and loaded into Azure blob storage. The solution they adopted uses SQL Server Integration Services with a package that connects to the Twitter streaming API every day to download new tweets, and uploads these as a file to blob storage.
The team will also review the columns used in the query; for example, by using only the Tweet column rather than storing other columns that are not used in the current process. This might be possible by adding a Hive staging table and preprocessing the data. However, the team needs to consider that the data in other columns may be useful in the future.
To complete the examination of the data the team next explored how the results would be used, as described in Consuming the results.