Scenario 2: Data warehouse on demand
From: Developing big data solutions on Microsoft Azure HDInsight
In a data warehousing scenario, HDInsight is used as a data source for big data analysis and reporting. This scenario also discusses how you can minimize running costs by shutting down the cluster when it’s not is use. Specifically, the scenario describes:
- Introduction to A. Datum
- Analytical goals and data sources
- Creating the data warehouse
- Loading data into the data warehouse
- Analyzing data from the data warehouse
Introduction to A. Datum
This scenario is based on a fictional company named A. Datum, which conducts research into tornadoes and other weather-related phenomena in the Unites States. In the scenario, data analysts at A. Datum want to use HDInsight as a central repository for historical tornado data in order to analyze and visualize previous tornados, and to try to identify trends in terms of geographical locations and times.
Analytical goals and data sources
The data analysts at A. Datum have obtained historical data about tornadoes in the United States since 1934. The data includes the date and time, geographical start and end point, category, size, number of fatalities and casualties, and damage costs of each tornado. The goal of the data warehouse is to enable analysts to “slice and dice” this data and visualize aggregated values on a map. This will act as a resource for research into severe weather patterns, and will support further investigation in the future.
The data analysts also implemented a mechanism to continuously collect new data as it is published, and this data is added to the existing data by being uploaded to Azure blob storage. However, as a research organization, A. Datum does not monitor the data on a daily basis—this is the job of the storm warning agencies in the United States. Instead, A. Datum’s primary aim is to analyze the data only after significant tornado events have occurred, or on a quarterly basis to provide updated results for use in their scientific publications. Therefore, to minimize running costs, the analysts want to be able to shut down the cluster when it is not being used for analysis, and recreate it when required.
This scenario demonstrates:
- How you can define and create a data warehouse containing a database and Hive tables in HDInsight.
- How you can automate the loading of data into the tables in the data warehouse.
- How you can define queries to extract the data from the data warehouse.
- How you can view and analyze the data, and generate compelling visualizations using a range of tools