Scenario 3: ETL automation

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Watch video

In this scenario, HDInsight is used to perform an Extract, Transform, and Load (ETL) process that filters and shapes the source data, and then uses it to populate a database table. Specifically, this scenario describes:

The scenario demonstrates how you can:

  • Use the .NET Library for Avro to serialize data for processing in HDInsight.
  • Use the classes in the .NET API for Hadoop WebClient package to upload files to Azure storage.
  • Use an Oozie workflow to define an ETL process that includes Pig, Hive, and Sqoop tasks.
  • Use the classes in the .NET API for Hadoop WebClient package to automate execution of an Oozie workflow.

Introduction to racecar telemetry

This scenario is based on a fictitious motor racing team that captures and monitors real-time telemetry from sensors on a racecar as it is driven around a racetrack. To perform further analysis of the telemetry data, the team plans to use HDInsight to filter and shape the data before loading it into Azure SQL Database, from where it will be consumed and visualized in Excel. Loading the data into a database for analysis enables the team to decommission the HDInsight server after the ETL process is complete, and makes the data easily consumable from client applications that have the ability to connect to a SQL Server data source.

In this simplified example, the racecar has three sensors that are used to capture telemetry readings at one second intervals: a global positioning system (GPS) sensor, an engine sensor, and a brake sensor. The telemetry data captured from the sensors on the racecar includes:

  • From the GPS sensor:
    • The date and time the sensor reading was taken.
    • The geographical position of the car (its latitude and longitude coordinates).
    • The current speed of the car.
  • From the engine sensor:
    • The date and time the sensor reading was taken.
    • The revolutions per minute (RPM) of the crankshaft.
    • The oil temperature.
  • From the brake sensor:
    • The date and time the sensor reading was taken.
    • The temperature of the brakes.

Note

The sensors used in this scenario are deliberately simplistic. Real racecars include hundreds of sensors emitting thousands of telemetry readings at sub-second intervals.

ETL process goals and data sources

Motor racing is a highly technical sport, and analysis of how the critical components of a car are performing is a major aspect of how teams refine the design of the car, and how drivers optimize their driving style. The data captured over a single lap consists of many telemetry readings, which must be analyzed to find correlations and patterns in the car’s performance.

In this scenario, a console application is used to capture and display the sensor readings in real-time. The application is shown in Figure 1.

Figure 1 - A console application displaying racecar telemetry

Figure 1 - A console application displaying racecar telemetry

The application captures the sensor readings as objects based on the following classes. Note that the Position property of the GpsReading class is based on the Location struct.

[DataContract]
internal struct Location
{
  [DataMember]
  public double lat { get; set; }
  [DataMember]
  public double lon { get; set; }
}

[DataContract(Name = "GpsReading", Namespace = "CarSensors")]
internal class GpsReading
{
  [DataMember(Name = "Time")]
  public string Time { get; set; }

  [DataMember(Name = "Position")]
  public Location Position { get; set; }

  [DataMember(Name = "Speed")]
  public double Speed { get; set; }
}

[DataContract(Name = "EngineReading", Namespace = "CarSensors")]
internal class EngineReading
{
  [DataMember(Name = "Time")]
  public string Time { get; set; }

  [DataMember(Name = "Revs")]
  public double Revs { get; set; }

  [DataMember(Name="OilTemp")]
  public double OilTemp { get; set; }
}

[DataContract(Name = "BrakeReading", Namespace = "CarSensors")]
internal class BrakeReading
{
  [DataMember(Name = "Time")]
  public string Time { get; set; }

  [DataMember(Name = "BrakeTemp")]
  public double BrakeTemp { get; set; }
}

As the application captures the telemetry data, each sensor reading object is added to a List as defined in the following code.

static List<GpsReading> GpsReadings = new List<GpsReading>();
static List<EngineReading> EngineReadings = new List<EngineReading>();
static List<BrakeReading> BrakeReadings = new List<BrakeReading>();

As part of the ETL processing workflow in HDInsight, the captured readings must be filtered to remove any null values caused by sensor transmission problems. At the end of the processing the data must be restructured to a tabular format that matches the following Azure SQL Database table definition.

CREATE TABLE [LapData]
(
  [LapTime] [varchar](25) NOT NULL PRIMARY KEY CLUSTERED,
  [Lat] [float] NOT NULL,
  [Lon] [float] NOT NULL,
  [Speed] [float] NOT NULL,
  [Revs] [float] NOT NULL,
  [OilTemp] [float] NOT NULL,
  [BrakeTemp] [float] NOT NULL,
);

The workflow and its individual components are described in The ETL workflow.

Next Topic | Previous Topic | Home | Community