2019-01-16

July 2017

Volume 32 Number 7

[Machine Learning]

Cognition at Scale with U-SQL on ADLA

By Hiren Patel | July 2017

Companies providing cloud-scale services have an ever-growing need to store and analyze massive data sets. Such analysis to transform data into insight is becoming increasingly valuable. For instance, analyzing telemetry from a service to derive insights into the investments that most improve service quality, analyzing usage patterns over time to detect changes in user behavior (engagement/churn), analyzing sensor data to perform preventative maintenance—all these are extremely important, as has become very apparent to us while running massive services like Bing and Cosmos. Most of these analyses involve feature engineering and modeling. With storage getting cheaper, there are no longer any constraints on the amount of data we can collect, which means we soon reach the limits of traditional single-node data processing engines and require a distributed processing platform to do machine learning tasks on massive datasets. Furthermore, machine learning models are usually built or used in applications or pipelines that involve processing raw data—deserializing data, filtering out unnecessary rows and columns, extracting features, and transforming them to a form amenable for modeling. To express such operations easily, users need a programming model that offers a degree of compositional freedom that’s typical of declarative languages.

U-SQL is a declarative language that provides the expressibility necessary for advanced analytics tasks, like machine learning and operating seamlessly on cloud-scale data. It also offers the following advantages:

The resemblance of U-SQL to SQL reduces the learning curve for users. It offers easy extensibility with user-defined operators, the ability to reuse existing libraries and the flexibility to choose different languages (C#, Python or R) to develop custom algorithms.
Users can focus on business logic while the system takes care of data distribution and task parallelism, along with execution plan complexities.
U-SQL has built-in support for machine learning.

Machine Learning Using U-SQL in ADLA

Building intelligent features into applications requires some form of prediction capability. There are two ways to go:

Build your own model: You first preprocess the telemetry or any kind of raw data into a shape suitable for modeling, then train a model on the pre-processed data and use this trained model for prediction in applications. The Azure Data Lake Analytics (ADLA) engine makes all the preprocessing possible in an efficient manner. It allows you to build machine learning models that cover a wide variety of scenarios, from building regression models to image classification via R and Python extensions, and enables you to build models using built-in, efficient, massively parallelable distributed machine learning algorithms. (We’ll discuss how to train a model using U-SQL in a future article.)

Using a pre-trained model for scoring: Suppose you have a pre-trained model but want to score large amounts of data efficiently. U-SQL can handle this pleasingly parallel task very well. U-SQL allows user-defined operators (UDOs) where you provide only a per-partition or per-row function and the system takes care of distributing the data and the tasks in the most efficient manner within given resource constraints.

Training a good model requires a lot of data and machine learning expertise—both of which are rare commodities. To help, U-SQL packages many of the machine learning models that power the Microsoft Cognitive Services. Built by some of the leading minds in the industry, these models are trained against massive data sets, and are highly performant and accurate. This integration of the cognitive models in U-SQL lets you easily add intelligent features—such as emotion detection, face and speech recognition; language understanding and sentiment analysis—to applications that work on massive amounts of data.

In this article, we’ll concentrate on how to use these pre-trained models to build intelligent applications using U-SQL, using either a pre-trained model or a built-in cognitive model.

Cognition with U-SQL

U-SQL provides built-in support for the following cognitive models, allowing you to build applications with powerful algorithms using just a few lines of code:

Face detects one or more human faces in an image, along with face attributes that contain machine learning-based predictions based on features such as age, emotion and gender.

Emotion analyzes facial expressions in an image to detect a range of emotions, currently including anger, contempt, disgust, fear, happiness, neutrality, sadness and surprise.

Image tagging returns information about visual content found in an image. It can be used along with descriptions and domain-specific models to identify content in an image.

Optical character recognition (OCR) detects and extracts handwritten text from images of notes, letters, whiteboards and so forth, and returns a machine-readable character stream.

Sentiment analysis detects sentiment using classification techniques based on the input text.

Key phrases extractionidentifies key phrases, topics and language from the input text.

Landmark detectionfinds landmarks in an image. This model recognizes 9,000 natural and man-made landmarks from around the world.

Cognition at Scale with U-SQL

Suppose you want to find out if the human population, in general, is happy when there are animals around them. One way to do this is to examine pictures posted by people who have animals with them and analyze the emotion of those people. In Microsoft, the set of Bing-crawled images would represent a valid dataset for this task, but the sheer scale of the data set would make this simple task cumbersome without an intelligent data-processing engine. Let’s see how U-SQL makes this easy. First, however, you’ll have to manually enable cognitive capabilities in your Azure Data Lake account.

Registering Cognitive Capabilities in U-SQL To get started with the Python, R, and Cognitive extensions, open your Data Lake Analytics account in the Azure Portal and click on Sample Scripts.

If you haven’t installed them already, you’ll see a notification at the top of the Sample Scripts blade for U-SQL Advanced Analytics.

Click it to begin the installation process.

Once you’ve chosen to install the extensions, the system will copy U-SQL extension-related files into the default Azure Data Lake Store (ADLS) associated with your ADLA account. A notification that files are being copied will appear near the Notification icon in the upper right of the page. When the files are copied, you’ll see an updated notification that the file copying was successful and that a special U-SQL Job was submitted to finish the registration. You can find the special job and its status by using View All Jobs in the upper-left corner of the Azure Portal. The Job normally will take a few minutes to complete.

At that point, you can discover what the job did by browsing the catalog for the master database. The job simply registers advanced analytic cognitive assemblies in the master database, which you can see by using the Data Explorer.

Seeing these assemblies in the master database is evidence that your ADLA account is setup correctly and you can now write a U-SQL script that uses cognitive functions to build intelligent applications.

Using Cognitive Functions in U-SQL As described earlier, the assemblies and models that power the Cognitive Services have been integrated with U-SQL, allowing you to run simple query statements over millions of images and process them using cognitive functions. The overall method for using these cognitive capabilities at scale in U-SQL is simply:

Use the REFERENCE ASSEMBLY statement to include the cognitive functions in the U-SQL script.
Use the EXTRACT operation to load data into a rowset.
Use the PROCESS operation to apply various Cognitive functions.
Use the SELECT operation to apply transformations to the predictions.
Use the OUTPUT operation to store the result into persistent store.

Let’s continue with the scenario described earlier, which involves processing a large number of images and analyzing the emotion of people when there are animals in the image. Figure 1 shows an example script for completing this scenario. In the example, we use the Vision cognitive functions, which enables us to understand what’s in an image and returns a set of tags that identify objects. For the sample script in Figure 1, we’re using a subset of the images from the team’s 1 million images dataset.

Figure 1 Cognition at Scale Example

REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
// Load in images
@imgs =
  EXTRACT FileName string, ImgData byte[]
  FROM @"/usqlext/samples/cognition/{FileName:*}.jpg"
  USING new Cognition.Vision.ImageExtractor();
// Extract the number of objects and tags from each image
@objects =
  PROCESS @imgs
    PRODUCE      FileName,
      NumObjects int,
      Tags string
    READONLY FileName
    USING new Cognition.Vision.ImageTagger();
// Extract all the images with dog and cat
@pets =
  SELECT FileName, T.Tag
    FROM @objects
    CROSS APPLY
    EXPLODE(SqlArray.Create(Tags.Split(';'))) AS T(Tag)
    WHERE T.Tag.ToString().Contains("dog") OR T.Tag.ToString().Contains("cat");
// Detect emotions from human face 
@emotions =
  PROCESS @imgs
    PRODUCE      FileName string,
      NumFaces int,
      Emotion string
    READONLY FileName
    USING new Cognition.Vision.EmotionAnalyzer();
// Correlation to find all the images which has both human and animals 
@preres =
  SELECT @pets.FileName, Emotion
    FROM @pets
    JOIN @emotions
    ON @pets.FileName == @emotions.FileName;
// Distribution of human emotions when animals are in the image
@result =
  SELECT Emotion, COUNT(FileName) AS frequency
    FROM @preres
    GROUP BY Emotion;
OUTPUT @result
  TO @"/my/cognition/output/sample_dog_cat.csv"
  USING Outputters.Csv();

In this simple U-SQL query, we’re doing some very powerful things. First, we’re extracting images into the byte array column using the system-provided ImageExtractor, and then loading them into rowsets. Next, we extract all the tags from those images using the built-in ImageTagger. Then we filter the images, finding those that have “cat” or “dog” tags. Using the system-provided EmotionAnalyzer, we next extract the faces and associated emotions from these images, then find all the images that have a human along with a dog or a cat. Finally, we output the distribution of human emotions in those images.

To demonstrate the scalability of U-SQL, we executed the same script on the full data set with 1 million images. As soon as we submit the script, in a matter of seconds, thousands of containers in ADLA spring to action to start processing these images, as shown in the Figure 2.

Figure 2 Job Execution Graph

You can easily extend this example to get other interesting insights, like the most frequently occurring pairs of tags, objects that appear together most often and so on. Furthermore, you can also detect age, gender, and landmarks from these images using other cognitive functions. For your reference, we’ve added the code snippets in Figure 3 to describe how to use other built-in cognitive functions in U-SQL applications.

Figure 3 Code Snippets for Other Cognitive APIs Supported in U-SQL

// Estimate age and gender for human faces
@faces =
  PROCESS @imgs
  PRODUCE FileName,
    NumFaces int,
    FaceAge string,
    FaceGender string
    READONLY FileName
  USING new Cognition.Vision.FaceDetector();
// Apply OCR
@ocrs =
  PROCESS @imgs
  PRODUCE FileName,
    Text string
  READONLY FileName
  USING new Cognition.Vision.OcrExtractor();
// Sentiment Analysis on War and Peace
@sentiment =
  PROCESS @WarAndPeace
  PRODUCE No, Year, Book, Chapter,
    Text, Sentiment string,
            Conf double
  READONLY No,
    Year,
    Book,
    Chapter,
    Text
  USING new Cognition.Text.SentimentAnalyzer(true)

Using a Pre-Trained Model

Most traditional machine learning algorithms assume that the data processed to train a model isn’t too large to store in the RAM of one computer. Thus, most of the time users need only a single-box environment to train their models. Furthermore, it’s relatively common to have only a small amount of label data on which a model is trained. The R and Python languages have emerged as the industry standard for open source, as well as proprietary predictive analytics. R and Python together provide many capabilities, such as flexibility, rich graphics and statistics-oriented features, along with an ecosystem of freely available packages that account for much of its growing popularity. Thus, many developers uses R and Python to do single-box predictive analytics.

Once trained, a model is applied to massive data sets that frequently eclipse the size of the training data by orders of magnitude. In the following section, we’ll describe how to use an existing model that was trained to do prediction using a local R/Python environment on a massive amount of data, using the U-SQL extension on ADLA.

Consider an arboretum that has an inventory of many species of flowers from all around the world. Now the organization wants to find and classify types of iris flowers in its inventory. The arboretum’s data scientist trained a model to label types of iris flowers using R on a single machine, but they have a great many species of flowers from all over the world and the pre-trained model can’t complete this simple task of identifying an iris flower. What’s needed is an intelligent, scalable data processing engine. The overall process to use these U-SQL R extensions to do prediction at scale is simply:

Use the REFERENCE ASSEMBLY statement to include the R U-SQL extension to run R code in the U-SQL script.
Use the DEPLOY RESOURCE operation to upload the pre-trained model as a resource on executing nodes.
Use DECLARE to inline the R script in the U-SQL script.
Use the EXTRACT operation to load data into a rowset.
Use the Extension.R.Reduce function to run the R script to score each row in rowset using the uploaded pre-trained model.
Use the OUTPUT operation to store the result into a persistent store.

Figure 4 shows the U-SQL script that carries out this process.

Figure 4 Using Pre-Existing Model in U-SQL Script

REFERENCE ASSEMBLY [ExtR];
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
// R script to score using pre trained R model
DECLARE @MyRScript =
  @"
    load(""my_model_LM_Iris.rda"")
    outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))
  ";
DECLARE @PartitionCount int = 10;
@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @"/usqlext/samples/R/iris.csv";
  USING Extractors.Csv();
@ExtendedData =
  SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
    SepalLength,
    SepalWidth,
    PetalLength,
    PetalWidth
  FROM @InputData;
// Predict Species
@RScriptOutput= REDUCE @ExtendedData
  ON Par
  PRODUCE Par,
    fit double,
      lwr double,
      upr double
  READONLY Par
  USING
    new Extension.R.Reducer(command:@MyRScript , rReturnType:"dataframe",
      stringsAsFactors:false);
OUTPUT @RScriptOutput
  TO @"/Output/LMPredictionsIris.txt"
  USING Outputters.Tsv();

In this simple U-SQL query, we’re using the U-SQL R extension to do scoring at scale. The R and Python U-SQL extensions get automatically installed and registered with ADLA account database when you install the U-SQL Advance Analytics Extension. In the U-SQL script, we first deploy the pre-existing model, which was trained using R on a single machine. This highlights the fact that it wasn’t trained using ADLA/U-SQL framework. Next, we extract and de-serialize the iris dataset into columns using the system-provided .csv format extractor, Extractors.cvs, and load the data into rowsets. Next, we generate a random number that will be used later to partition data to enable parallel processing. Then, we use the U-SQL R extension UDO Extension.R.Reducer and pass the R script that does the prediction, along with the model. Finally, we output the confidence interval for each flower from the inventory.

We started with a simple U-SQL script to understand the content of images, which is typically considered opaque. The script automatically scales across hundreds of machines to transform images efficiently into actionable insights that can power intelligent applications. We also showcase how you can reuse an existing model that was trained using the popular R/Python environment and apply the model to do prediction on a massive amount of data using U-SQL R Extension. This is what can power the intelligence revolution.

Hiren Patel is a senior technical program manager at Microsoft. He has been part of the Big Data group since 2011 and worked on designing and developing various aspect of the Cosmos/ADLA distributed execution engine, including language, optimizer, runtime and scheduling.

Shravan Matthur Narayanamurthy is a senior engineering manager at Microsoft leading the Big Data Machine Learning team. He has several years of experience researching and developing machine learning algorithms that operate at cloud scale and distributed systems.

Thanks to the following Microsoft technical experts who reviewed this article: Saveen Reddy and Michael Rys
Saveen Reddy is a principal program manager at Microsoft focused on designing and building the Azure Data Lake Platform -- the components and experiences supporting all of Microsoft’s Big Data cloud services. Saveen holds a 100 percent completion rating for Metal Gear Solid V: The Phantom Pain. Follow him on Twitter: @saveenr

Michael Rys is a principal program manager at Microsoft. He has been doing data processing and query languages since the 1980s. He has represented Microsoft on the XQuery and SQL design committees and has taken SQL Server beyond relational with XML, Geospatial and Semantic Search. Currently he’s working on Big Data query languages such as SCOPE and U-SQL when he’s not enjoying time with his family underwater or at autocross. Follow him on Twitter: @MikeDoesBigData.

Share via