2019-02-01

January 2016

Volume 31 Number 1

[Big Data]

Creating Big Data Pipelines Using Azure Data Lake and Azure Data Factory

By Gaurav Malhotra | January 2016

This year Microsoft Azure Big Data offerings were expanded when the Azure Data Lake (ADL) service, along with the ability to create end-to-end (E2E) Big Data pipelines using ADL and Azure Data Factory (ADF) were announced. In this article, I’ll highlight the use of ADF to schedule both one-time and repeating tasks for moving and analyzing Big Data.

ADL makes processing Big Data simpler and more accessible by providing several key technologies. The U-SQL language is a powerful combination of SQL and C# that supports parallel execution. You can run U-SQL in the ADL Analytics cloud offering, where hundreds or thousands of containers can be reserved, used and released in the lifetime of a single job. Setting up this cloud environment with ADL is easy. Using the Azure Management portal, you can quickly and easily create ADL accounts for both storage and analytics, and provision an ADF. Within minutes, with the click of a few buttons, you can set up all the necessary accounts in your Azure subscription.

Once the account provisioning is complete, you can create end-to-end Big Data pipelines in ADF using the Azure Management Portal, Windows PowerShell, the C# SDK and Visual Studio tools. ADF is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. The ADF-ADL integration allows you to:

Move data from a given source to the ADL Store.
Create Big Data ADF pipelines that run U-SQL as a processing step on the ADL Analytics service.

There are a number of common Big Data scenarios that ADL and ADF address, including customer churn analysis, personalized product recommendations and actuarial processing. One that’s interesting to many Azure customers is analyzing Web services or application logs. In this article, I’ll show you how you can create data factory pipelines to analyze Web logs, by first moving your Web logs to ADL and then running U-SQL scripts to process them.

The Web Log Analysis Scenario

A common business insight scenario is the analysis of Web logs to understand the volume and pattern of user requests based on the originating regions or locales around the world. Such analysis enhances customer understanding, marketing campaigns and product roadmaps, including localization plans. Logs are emitted by Web applications, network devices, OSes, and all manner of intelligent or programmable devices. Data streams, such as error logs, clickstream instrumentation and Web server logs, can easily accumulate at the rate of gigabytes or terabytes a week. The Web logs can accumulate in any form of storage, including SQL Azure, Azure Blob Storage, Amazon Simple Storage Service (S3), and on-premises and Oracle databases. Analyzing these logs quickly and efficiently to discover usage patterns and system issues helps companies understand customer usage better and, ultimately, drive up customer engagement and satisfaction.

It’s easy to store these Web logs in the ADL Store, which can elastically scale to store petabytes of data. When provisioning an ADL Store account, no sizing parameters are needed—the account simply grows to accommodate the size of the files loaded into it. Because you pay only for what’s actually stored, the service is cost-effective, and the lack of fixed limits on account or file size, plus massive throughput capability, make it ideal for performing Big Data analytics. In addition, you don’t have to rewrite code or move your data to a different storage platform as the size of data stored increases or decreases.

Parsing and aggregating Web logs based on certain partitions, like region, are activities that allow a high degree of parallelism. The ideal case is to have subsets of records farmed out to individual servers that can parse, transform and summarize those records. Then, these partial results are merged in multiple parallel stages until the final aggregated dataset is created. Managing this process manually is extremely complicated and prone to sub-optimal execution based on incomplete information about the system and the changing shape of the data from day to day. However, this is exactly what ADL Analytics and the U-SQL language do automatically. U-SQL allows you to express your target aggregations in a declarative SQL-like query syntax that doesn’t require specifying any parallelism directives. The compiler and scheduler then figure out the degree of parallelism inherent in the job and allocate resources based on that parallelism and any limits specified for maximum resource usage. Using ADF, you can easily build up a pipeline by specifying such U-SQL tasks; connect them with other series of tasks; add activities to move data from your Web servers to ADL Store; and create a schedule to regularly process the data. The simplicity of creating the pipeline and components lets you focus on your business logic instead of on how to optimize processing and storing large datasets.

Getting Set up

You start by creating the ADL Store and analytics accounts and provisioning a data factory. This is all done through the Azure Management Portal. An ADL analytics account is the entity that helps you group and manage the queries and programs you run to do Big Data analysis. You can manage billing by associating your account with different Azure subscriptions, and choose pricing tiers. There are options for grouping the account with other resources for tracking purposes. You can even choose the region of the datacenter where your account lives, which can be useful for managing proximity to on-premises data.

The ADL Store service is a service for storage of Big Data that can be accessed from HDFS-compliant systems, including business intelligence (BI) tools and on-premises applications. The setup is very simple and you don’t have to specify any limits at setup time. As with the analytics service, an important option is the geographic region where you want the data housed. In the case of data storage, this can be critical as there might be business requirements related to legal compliance for where data about a region’s citizens is stored. ADL Store accounts can be created separately and used with other services, but the most common case is to create an account in tandem with ADL Analytics. Figure 1 shows the screen for creating an ADL account. You provide a name for the account and choose your subscription, resource group and location. The Create New Data Lake Store option lets you create a new store at the same time and with the same options as your Analytics account.

Figure 1 Creating ADL Accounts in the Azure Management Portal

In this example, the Web log data is stored in Azure Storage, which allows you to store blob data on Azure. You can also create Azure Storage through the portal. Whether you use an Azure Web App, or a Web site hosted somewhere else, getting your data onto Azure Storage is easy and means it will have high availability and durability. The sample Web log data used in the example can be found at bit.ly/1ONi8c5.

Figure 2 shows how you can provision an Azure data factory. As you can see, the process of setting up these cloud-based services is very easy. You don’t need to use the same geographic region as the ADL services—the data factory can operate with services in any region.

Figure 2 Provisioning an Azure Data Factory

Data factories are a combination of data stores, linked services and pipelines. Data stores and linked services are definitions of external entities that usually already exist outside of ADF. Pipelines are a logical grouping of activities in ADF. They are used to group activities into a unit that together performs a task. You’ll see this in more detail as I walk through setting up the data factory for the weblog analytics.

You can create these entities using the Azure Management Portal or Visual Studio. In the portal, under the Data Factory view, an Author and Deploy option allows you to select the individual components of a data factory by type, and provides JSON snippets that can be edited directly and published (see Figure 3). Alternatively, you can take advantage of the ADF tools for Visual Studio and use a project format to identify and define each of the components of the data factory (Figure 4). The project can then also be published to create these entities in your data factory in Azure.

Figure 3 Author and Deploy a Data Factory Using the Web Editor

Figure 4 Author and Deploy a Data Factory Using the Visual Studio Plug-In

Moving Data to Azure Data Lake Store

The first step in the Web log analysis scenario is to move the data to ADL Store. You can move data to ADL Store using the Copy activity in an ADF pipeline. In order to do the copy operation, you need to create ADF linked services, datasets and pipelines. Linked services in ADF define the information needed to connect to external resources. Linked services are used for two purposes in Data Factory. The first is to represent a data store including, but not limited to, an on-premises SQL Server, Oracle database, file share or Azure blob storage account. The second is to represent a processing resource that can host the execution of an activity. For example, the HDInsight Hive activity executes on an HDInsight Hadoop cluster. In this case, you need to create two linked services, one corresponding to the Azure Storage account and the second representing the ADL Store.

You also need to create two ADF datasets. Datasets are logical references to data in an Azure Storage account or ADL Store. No user data is stored in ADF itself, so dataset definitions are needed for ADF to identify the structure of data in the external data stores, including tables, files, folders and documents. Because ADF doesn’t know the structure of this data, you need to define it here so the system knows what columns and types to expect. In this case, you need to create one dataset corresponding to the Azure Storage account location that contains the Web log data (source) and a second dataset corresponding to the ADL Store where you want to move the Web logs (sink).

Finally, for the data copy to happen, you need to create an ADF pipeline that contains a Copy activity. An ADF pipeline is a logical grouping of activities, such as data Copy, that can run at different intervals, and Hive, Pig, or U-SQL script activities that can run regularly—every 15 minutes, hourly, daily, or monthly. The Copy activity in ADF is very powerful and allows you to copy data between on-premises or cloud sources and sinks that may have different schemas. You can specify a few parameters, or accept defaults, to begin. You have a lot of control and can tune things like the schedule and policies for handling error conditions.

Although a pipeline can run on a repeating schedule, in the current example it’s run just once to move the data into the ADL Store. The JSON snippet in Figure 5 shows a definition for a pipeline called EgressBlobToDataLakePipeline. This pipeline has a Copy activity to move data from Azure Blob Storage to Azure Data Lake Store. It’s scheduled to run on 08/08/2015 and will only run once (the “start” and “end” properties for the pipeline active period are the same).

Figure 5 EgressBlobToDataLakePipeline Sample Pipeline Definition

{
  "name": "EgressBlobToDataLakePipeline",
  "properties": {
    "description": "Egress data from blob to azure data lake",
    "activities": [
      {
        "type": "Copy",
        "typeProperties": {
          "source": {
            "type": "BlobSource",
            "treatEmptyAsNull": true
          },
          "sink": {
            "type": "AzureDataLakeStoreSink",
            "writeBatchSize": 10000,
            "writeBatchTimeout": "00:10:00"
          }
        },
        "inputs": [
          {
            "name": "RawBlobDemoTable"
          }
        ],
        "outputs": [
          {
            "name": "DataLakeTable"
          }
        ],
        "policy": {
          "timeout": "10:00:00",
          "concurrency": 1,
          "executionPriorityOrder": "NewestFirst",
          "retry": 1
        },
        "scheduler": {
          "frequency": "Day",
          "interval": 1
        },
        "name": "EgressDataLake",
        "description": "Move data from blob to azure data lake"
      }
    ],
    "start": "2015-08-08T00:00:00Z",
    "end": "2015-08-08T01:00:00Z",
    "isPaused": false
  }
}

When the copy activity in the ADF pipeline completes successfully, the Web logs have been moved from Azure Blob Storage to Azure Data Lake Store. You can learn more about Azure Data Factory data movement activities at bit.ly/1MNbIqZ, and more about using AzureDataLakeStore connector in ADF at bit.ly/1MRwvVZ. Now you’re ready to process and analyze Web logs.

Creating a Pipeline with U-SQL Activities

With the data in ADL Store, you can now run U-SQL scripts on ADL Analytics service to process and analyze the Web logs. You can create pipelines that will consume the data from ADL Store, run the U-SQL scripts on ADL Analytics service as a processing step and produce the output in ADL Store. The downstream applications can then consume the processed output directly from ADL Store or you can choose to copy the data from ADL Store to Azure SQL Data warehouse if your BI applications are using a SQL warehouse as a back-end store.

In order to process the Web logs, you need to create ADF linked services, datasets and pipelines again. You can reuse the ADL Store linked service created in the previous step if you want to create a sequence of pipelines that do the data movement first and then perform data analysis by running U-SQL scripts in a single ADF. Or, you can create a new data factory that just performs the data analysis. The ADF pipeline in this case contains an Azure Data Analytics U-SQL activity and runs a U-SQL script to determine all events for the Great Britain (“en-gb”) locale and a date less than “2012/02/19.” Figure 6 contains the JSON definition for ComputeEventsByEnGbRegionPipeline, which defines such a pipeline with a U-SQL activity to do Web log processing.

Figure 6 ComputeEventsByEnGbRegionPipeline Sample Pipeline Definition

{
  "name": "ComputeEventsByEnGbRegionPipeline",
  "properties": {
    "description": "This is a pipeline to compute events for en-gb locale
      and date less than 2012/02/19.",
    "activities": [
      {
        "type": "DataLakeAnalyticsU-SQL",
        "typeProperties": {
          "scriptPath": "scripts\\usql\\SearchLogProcessing.txt",
          "scriptLinkedService": "StorageLinkedService",
          "degreeOfParallelism": 3,
          "priority": 100,
          "parameters": {
            "in": "/datalake/input/SearchLog.tsv",
            "out": "/datalake/output/Result.tsv"
          }
        },
        "inputs": [
          {
            "name": "DataLakeTable"
          }
        ],
        "outputs": [
          {
            "name": "EventsByEnGbRegionTable"
          }
        ],
        "policy": {
          "timeout": "06:00:00",
          "concurrency": 1,
          "executionPriorityOrder": "NewestFirst",
          "retry": 1
        },
        "scheduler": {
          "frequency": "Day",
          "interval": 1
        },
        "name": "EventsByRegion",
        "linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
      }
    ],
    "start": "2015-08-08T00:00:00Z",
     "end": "2015-08-08T01:00:00Z",
    "isPaused": false
  }
}

The U-SQL script in Figure 7 being run by the pipeline resides in the scripts/usql folder (the scriptPathproperty in the pipeline JSON in Figure 5) in the Azure Blob Storage account corresponding to the deployed StorageLinkedService. The values for @in and @out parameters in the script are passed dynamically by ADF using the Parameters section in the pipeline JSON (see the “Parameters” section in Figure 6). You can also specify other properties, such as degreeOfParallelism or priority in your pipeline definition for the jobs that run on the ADL Analytics service. This U-SQL script processes Web logs and returns all events for the “en-gb” locale and date less than “2012/02/19.”

Figure 7 The U-SQL Script SearchLogProcessing.txt

@searchlog =
  EXTRACT UserId          int,
          Start           DateTime,
          Region          string,
          Query           string,
          Duration        int?,
          Urls            string,
          ClickedUrls     string
  FROM @in
  USING Extractors.Tsv(nullEscape:"#NULL#");
@rs1 =
   SELECT Start, Region, Duration
   FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
  SELECT Start, Region, Duration
  FROM @rs1
  WHERE Start <= DateTime.Parse("2012/02/19");
OUTPUT @rs1  
  TO @out
    USING Outputters.Tsv(quoting:false, dateTimeFormat:null);

Monitoring Big Data Pipelines

The Data Factory service provides a reliable and complete view of storage, processing and data movement services. It helps to quickly assess end-to-end data pipeline health, pinpoint issues and take corrective action, if needed. You can also visually track operational lineage and the relationships between your data across any of your sources, and see a full historical accounting of job execution, system health and dependencies from a single monitoring dashboard. The ADF Diagram view (see Figure 8) in the management portal shows the operational lineage of the data factory. You can see two pipelines and the corresponding datasets: EgressBlobToDataLakePipeline (copy data from Azure Blob Storage to Azure Data Lake Store) and ComputeEventsByEnGbRegionPipeline (get all events for the “en-gb” locale and date less than “2012/02/19”).

Figure 8 Azure Data Factory Diagram View

The ADF copy pipeline in Figure 8 will start running on 08/08/2015 as the datasets have a daily frequency and the start and end parameters in the pipeline definition are both set to 08/08/2015. Therefore, the pipelines will run only for that day and run the U-SQL script just once. You can learn more about scheduling ADF pipelines at bit.ly/1lEVjuM. Click on the EventsByEnGbRegionTable in the Diagram view to see the corresponding activity execution and its status (see Figure 9).

Figure 9 Azure Data Factory Activity View

You can see that the U-SQL activity in ComputeEventsByEnGbRegionPipeline in ADF has run successfully and created a Result.tsv file (/datalake/output/Result.tsv) in the AzureDataLakeStore account. The Result.tsv contains all Web log events for the “en-gb” locale and a date less than 2012/02/19. You can also log in to the management portal and use the Azure Data Lake Data Explorer to visualize the Result.tsv file generated (go back to Figure 4) as part of the processing step in ADL Store.

You can find detailed documentation about AzureDataLakeAnalyticsU-SQL activity in Azure Data Factory at bit.ly/1WWtxuy.

Wrapping Up

By following the described steps, you can build an end-to-end Big Data pipeline using Azure Data Factory that allows you to move data to Azure Data Lake Store. You can then use a U-SQL script on the Azure Data Lake Analytics service to do Web log processing. The resulting system can dynamically scale according to your needs, and can be extended to run on a recurring basis. You can an also do further downstream processing on the Web log output and move it to another back-end store so the results can be consumed by Power BI or any other BI application your organization uses. Moreover, if you prefer, you can use ADF PowerShell cmdlets, the C# SDK, and the Visual Studio plug-in to build these E2E Big Data pipelines using ADL. Azure Data Lake, together with Azure Data Factory, takes away the complexities normally associated with Big Data in the cloud, ensuring that your current and future business needs can be met. Watch this space for more on solving Big Data problems using Azure services.

Gaurav Malhotra is a program manager on the Azure Data Factory team. He lives and works in Redmond, Wash. Reach him at gamal@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Omid Afnan, Harish Kumar and Sachin Sheth
Omid Afnan is a principal program manager in the Azure Big Data team working on implementations of distributed computation systems and related developer tool chains. He lives and works in China. Reach him at omafnan@microsoft.com.

Harish Kumar Agarwal is a program manager in the Azure Data Factory team. He can be reached at hariag@microsoft.com.

Sachin Sheth is a program manager on the Azure Data Lake team. You can reach him at sachins@microsoft.com.

Share via