Use case 3: ETL automation

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

In a traditional business environment, the data to power your reporting mechanism will usually come from tables in a database. However, it’s increasingly necessary to supplement this with data obtained from outside your organization. This may be commercially available datasets, such as those available from Azure Marketplace and elsewhere, or it may be data from less structured sources such as social media, emails, log files, and more.

You will, in most cases, need to cleanse, validate, and transform this data before loading it into an existing database. Extract, Transform, and Load (ETL) operations can use Hadoop-based systems such as HDInsight to perform pattern matching, data categorization, de-duplication, and summary operations on unstructured or semi-structured data to generate data in the familiar rows and columns format that can be imported into a database table or a data warehouse. ETL is also the primary way to ensure that the data is valid and contains correct values, while data cleansing is the gatekeeper that protects tables from invalid, duplicated, or incorrect values.

The following sections of this topic provide more information:

  • Use case and model overview
  • When to choose this model
  • Data sources
  • Output targets
  • Considerations

There is often some confusion between the terms ETL and ELT. ETL, as used here, is generally the more well-known, and describes performing a transformation on incoming data before loading it into a data warehouse. ELT is the process of loading it into the data warehouse in raw form and then transforming it afterwards. Because the Azure blob storage used by HDInsight can store schema-less data, storing the raw data is not an issue (it might be when the target is a relational data store). The data is then extracted from blob storage, transformed, and the results are loaded back into blob storage. See ETL or ELT… or both? on the Microsoft OLAP blog for a more complete discussion of this topic.

Use case and model overview

Figure 1 shows an overview of the use case and model for ETL automation. Input data is transformed to generate the appropriate output format and data content, and then imported into the target data store, application, or reporting solution. Analysis and reporting can then be done against the data store, often by combining the imported data with existing data in the data store. Applications such as reporting tools and services can then consume this data in an appropriate format, and use it for a variety of purposes.

Figure 1 - High-level view of the ETL automation model

Figure 1 - High-level view of the ETL automation model

The transformation process may involve just a single query, but it is more likely to require a multi-step process. For example, it might use custom map/reduce components or (more likely) Pig scripts, followed by a Hive query in the final stage to generate a tabular format. However, the final format may be something other than a table. For example, it may be a tab delimited file or some other format suitable for import into the target application.

Note

An example of using applying this use case and model can be found in Scenario 3: ETL automation.

When to choose this model

The ETL automation model is typically suited to the following scenarios:

  • Extracting and transforming data before you load it into your existing databases or analytical tools.
  • Performing categorization and restructuring of data, and for extracting summary results to remove duplication and redundancy.
  • Preparing data so that it is in the appropriate format and has appropriate content to power other applications or services.

Data sources

Data sources for this model are typically external data that can be matched on a key to existing data in your data store so that it can be used to augment the results of analysis and reporting processes. Some examples are:

  • Social media data, log files, sensors, and applications that generate data files.
  • Datasets obtained from Azure Marketplace and other commercial data providers.
  • Streaming data captured, filtered, and processed through a suitable tool or framework (see Collecting and loading data into HDInsight).

Output targets

This model is designed to generate output that is in the appropriate format for the target data store. Common types of data store are:

  • A database such as SQL Server or Azure SQL Database.
  • A document or file sharing mechanism such as SharePoint server or other information management systems.
  • A local or remote data repository in a custom format, such as JSON objects.
  • Cloud data stores such as Azure table and blob storage.
  • Applications or services that require data to be processed into specific formats, or as files that contain specific types of information structure.

You may decide to use this model even when you don’t actually want to keep the results of the big data query. You can load it into your database, generate the reports and analyses you require, and then delete the data from the database. You may need to do this every time if the source data changes between each reporting cycle in a way that means just adding new data is not appropriate.

Considerations

There are some important points to consider when choosing the ETL automation model:

  • This model is typically used when you want to:

    • Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system.
    • Cleanse, transform, and validate the data before loading it; perhaps by using more than one transformation pass through the cluster.
    • Generate reports and visualizations that are regularly updated.
    • Power other applications that require specific types of data, such as using an analysis of previous behavioral information to apply personalization to an application or service.
  • When the output is in tabular format, such as that generated by Hive, the data import process can use the Hive ODBC driver or Linq To Hive. Alternatively, you can use Sqoop (which is included in in the Hadoop distribution installed by HDInsight) to connect a relational database such as SQL Server or Azure SQL Database to your HDInsight data store and export the results of a query into your database. If you are using Microsoft Analytical Platform System (APS) you can access the data in HDInsight using PolyBase, which acts as a bridge between APS and HDInsight so that it becomes just another data source available for use in queries and processes in APS.

    Note

    Some other connectors for accessing Hive data are available from Couchbase, Jaspersoft, and Tableau Software.

  • If the target for the data is not a database, you can generate a file in the appropriate format within the query. This might be tab delimited format, fixed width columns, some other format for loading into Excel or a third-party application, or even for loading into Azure storage through a custom data access layer that you create. Azure table storage can be used to store table formatted data using a key to identify each row. Azure blob storage is more suitable for storing compressed or binary data generated from the HDInsight query if you want to store it for reuse.

  • If the intention is to regularly update the target table or data store as the source data changes you will probably choose to use an automated mechanism to execute the query and data import processes. However, if it is a one-off operation you may decide to execute it interactively only when required.

  • If you need to execute several operations on the data as part of the ETL process you should consider how you manage these. If they are controlled by an external program, rather than as a workflow within the solution, you will need to decide whether some can be executed in parallel, and you must be able to detect when each job has completed. Using a workflow mechanism such as Oozie within Hadoop may be easier than trying to orchestrate several operations using external scripts or custom programs. See Workflow and job orchestration for more information about Oozie.

Next Topic | Previous Topic | Home | Community