Using Azure ML to Build Clickthrough Prediction Models

This blog post is by Girish Nathan, a Senior Data Scientist at Microsoft.

Ad click prediction is a multi-billion dollar industry, and one that is still growing rapidly. In this post, we build ML models on the largest publicly available ad click prediction dataset, from Criteo. The Criteo dataset consists of some 4.4 billion advertising feedback events. In Criteo’s words, “…this dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.”

Azure services provide the tools needed to build a predictive model using this data. We use an Azure HDInsight cluster to load the Criteo data into Hive tables, and use Azure ML to build ML models on the dataset and understand it better. In particular, we show how to use the learning with counts technique to produce compact summaries of high dimensional categorical variables.

Problem Statement

Given Criteo’s click prediction dataset, we ask the following question:

For a given example, will the user click or not?

We model this as a binary classification problem, where a click gets the label “1” and lack of a click gets the label “0”.

We use the Azure ML DRACuLa (learning with counts) modules for building count features on the categorical data, and a two-class boosted decision tree learner for the binary classification problem. For manipulating the data prior to building counts, we use Azure HDInsight clusters.

Getting Access to the Data

The Criteo dataset is available here. After you accept the terms of use, you are directed to a page with details on the dataset and information on how to access it publicly. Note that this dataset is hosted in a public Azure Blob Storage account.

The Criteo Dataset

The Criteo dataset consists of 24 days of data, one file for each day. The data is in gzip format. For each row, the first column “Col1” indicates whether a click occurred (“1”) or not (“0”). Each row has 39 features in remaining columns; of these the first 13 are numeric while the last 26 are categorical features.

An excerpt of the data :

Schema : Column names Col1 – Col40 [ Col1 is the label column ]

For reasons of space, we show a truncated example below:

Col1, Col2, Col3,…,Col15,Col16,Col17,…,Col40

Per the data description, the numeric features typically represent various counts, while the categorical features are 4-byte hashed values of categorical strings. This hashing is primarily done as a security measure to protect privacy. We remark that this implies that while we can get insights into the number of unique values on each column (and combinations thereof), interpreting what these values might mean is not feasible due to the hashing.

A few points related to the data:

  1. Out of 24 days, the first 21 (day_0 and day_20) are stored in the directory raw/count; we use these 21 days of data for generating count features on the high dimensional categorical variables (explained in some detail in what follows).

  2. We use the data from day_21 as our training dataset; this is located in the directory raw/train.

  3. Finally, we use day_22 and day_23 as two separate test datasets. These are located in the directory raw/test. 

Data Preprocessing and Exploration Using Azure HDInsight Clusters

To explore the dataset and perform basic preprocessing before building ML models with Azure ML, we use an Azure HDInsight Hadoop cluster. For more information on how to create Hive tables on this dataset using an HDInsight cluster, please follow the walkthrough in this link; the rest of this section refers to sections in this walkthrough. The outcome of this walkthrough is to obtain the downsampled train and test datasets that are used in our model building below.

The section “Create Hive database and tables” describes creating Hive tables over the count, train, and test datasets. In what follows, we assume we have these tables built. Since other sections of the above walkthrough also describe feature exploration and how to downsample the data for use in Azure ML (in the section “Down sample the datasets for Azure ML”), we will not go into these topics in this blog post. Below is a short summary of the salient features of this dataset.

Number of examples: there are approximately 4.4 billion total examples in the dataset.

Number of training examples: we put approximately 192 million examples in the training dataset.

Number of test examples from the two datasets: For test data from day_22, we have about 189 million examples, while for test data from day_23, we have approximately 188 million examples.

Label distribution in the data: In this dataset, we have 3.3% positive examples (clicks with label “1” in Col0) and 96.7% negative examples (no-clicks with label “0” in Col0).

Understanding the number of unique values that the categorical variables take is of interest when building ML models, since high-dimensional categorical features can be challenging for some algorithms to handle. This dataset has 26 different categorical variables, so let’s take a look at a few of them to see how many unique values they take. To see how to do this, please refer to section “Find number of unique values for some categorical columns in the train dataset” in the above walkthrough. A summary for some columns follows.

            Total number of unique values of a few categorical features :

                        Col15 : 19011825

                        Col16 : 30935

                        Col17 : 15200

We note that some categorical features have a large number of unique values. In the next subsection, we suggest an effective technique for dealing with such high dimensional categorical variables.

Dealing with high dimensional categorical features

Traditionally, categorical features are dealt with via one-hot encoding. While this approach works well when the categorical features have few values, it results in feature space explosion for high-cardinality categorical features and is thus unsuitable for them.

An efficient way of dealing with high-cardinality categorical features is a method called DRACuLa based on label-conditional counts for the categorical features. These conditional counts can then be directly used as feature vectors together with log-odds values derived from them.

Generating count features on the high dimensional categorical features using Azure ML

In Azure ML experiments that we illustrate below, we use the Build Counting Transform and Apply Transformation modules to build count features from categorical variables, and featurize the train and test datasets with them.

Featurization is built on the count dataset (day_0 to day_20) and use those counts as features on our train dataset (day_21) and the test datasets (day_22 and day_23). For the Build Counting Transform module, we use the MapReduce option and set the number of classes to 2. We then provide credentials to the HDInsight cluster to be used for this computation (more information on how to do this is available in the documentation referenced above).

Model Building in Azure ML

After subsampling the data to create a training set for Azure ML training (please refer to the walkthrough), we save them in Azure ML workpace as Datasets. More information on how to do this is available via the Reader module documentation.

Modeling the problem using DRACuLa features and experimental results

Due to the skewed class distributions in our dataset, it is useful to downsample the negative examples in the train dataset so to have a 1:1 ratio of positive to negative examples.  We use this downsampled dataset as our training data for the experiment. After the class-balanced train dataset is created, we are ready to apply previously constructed count-based featurization on it.

As mentioned in an earlier blog post, count dataset is used for building count tables, which are used for featurization of the train and test datasets using resulting count features. Because all data from days 0 to 20 is utilized for building counts, resulting featurization is utilizing the complete class-conditional statistics from all available historical data.

We show a sample of the experiment to illustrate the set-up:

The Build Counting Transform module constructs the count table for the categorical features on a 970GB dataset allocated for counts using MapReduce on HDInsight (Hadoop). This module yields the data set “Criteo_count_table_transform” shown below. We then apply the resulting featurization using the Apply Transformation module to the train and test datasets. Applying the transformation essentially transforms the categorical features into class-conditional counts (and optionally log-odds if so desired). We show a part of the experiment that illustrates how these modules connect to each other.

 Once we have our training and test datasets ready, we are ready to learn a model; the learner chosen for this is the Two Class Boosted Decision Tree. Our trained model may then be applied on the test data to score it, like shown below.

The scoring portion of the experiment looks like so:

The overall experiment looks as follows:

Before we explore the results, we explain what the modules in the experiment mean.

The first set of R scripts applied to both the training and the test datasets just serves to give the data columns their correct names, like so:

dataset1 <- maml.mapInputPort(1) # class: data.frame
colnames(dataset1) <- c("Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10","Col11","Col12","Col13","Col14",
# Select data.frame to be sent to the output Dataset port

The second set of R scripts applied to both the training and the test datasets serves to balance the data so that the positive and negative classes are present in a ratio 1:1. For completeness, we include it below:

dataset1 <- maml.mapInputPort(1) # class: data.frame

# balance the classes so that pos-neg in a specified ratio

pos_neg_ratio <- 1

d_class0 <- subset(dataset1, Col1 == "0")
d_class1 <- subset(dataset1, Col1 == "1")

numRows_class0 <- nrow(d_class0)
numRows_class1 <- nrow(d_class1)

# downsample the negative class 0
numRows_class0_downsampled <- numRows_class1 * pos_neg_ratio

d_class0_downsampled <- d_class0[sample(numRows_class0, numRows_class0_downsampled, replace = FALSE), ]

# new output data frame containing 1:1 class ratios

data.set <- data.frame()

data.set <- rbind(data.set, d_class0_downsampled)

data.set <- rbind(data.set, d_class1)

# shuffle the rows

numRows_data.set <- nrow(data.set)
data.set <- data.set[sample(numRows_data.set, numRows_data.set, replace = FALSE), ]

# Select data.frame to be sent to the output Dataset port

A final set of R scripts is used after the “Score Model” module. This is used for computing the log loss and the script is shown below for convenience:

# Compute the log loss

# add guardrails when labels are 0,1

epsilon <- 1e-10
epsilon1 <- 1.0 - epsilon

ll <- function(actual, predicted)


    predicted <- max(epsilon, predicted)
    predicted <- min(epsilon1, predicted)

    score <- -(actual*log(predicted) + (1-actual)*log(1-predicted))
    score[actual==predicted] <- 0
    score[is.nan(score)] <- Inf

#' Compute the mean log loss
logLoss <- function(actual, predicted) mean(ll(actual, predicted))

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame

actual <- dataset1[,1]
predicted <- dataset1[,41]

mll <- logLoss(actual, predicted)

# Compute relative across entropy
prior <- sum(actual)/length(actual)
priorLogLoss <- -(prior*log(prior) + (1-prior)*log(1-prior))
rap <- 100*(priorLogLoss-mll)/priorLogLoss

#cat("Relative Across Entropy = ", rap)
out<- data.frame(mll,rap, prior, priorLogLoss)
colnames(out)<-c("Mean log-loss","Relative log-loss", "Prior", "Prior log-loss")

# Select data.frame to be sent to the output Dataset port

To cleanse the data of any missing values, we use the Clean Missing Data module. This module offers many options for cleaning data, as can be seen from the extensive documentation. In addition to the learner and the “Apply Transformation” modules that we already discussed, the experiment has a Train Model module for training and a subsequent Score Model module for scoring on the test dataset. Finally, we use an Evaluate Model module for evaluating model performance.

Although we have two days of test data, for simplicity, we show the testing experiment on the day_22 test dataset only. Since the problem is one of binary classification, an appropriate metric is the AUC. We use a confusion matrix, AUC, and ROC curves to summarize the prediction accuracy for this approach.

We mentioned that an R script computes the log loss. To see how our final log loss compares to the prior loss, we can compute the value. The result of this computation is shown below:


Modeling with no DRACuLa features and some experimental results

Having looked at the effect of using class-conditional count features (also referred to as DRACuLa features), we show now the corresponding experiment where no count features are used. Instead, in this case, the categorical features are modeled using one-hot encoding. We also note that the R scripts perform exactly the same function as the above experiment.

A snapshot of the full experiment looks as below:


We note that this experiment does not generate the count-based features, as mentioned. Since the rest of the modules are the same as our first experiment, we do not explain them in more detail here. 

Again, as this is a binary classification problem, we use the AUC as a reliable metric. We find:


The AUC when not using counts is actually lower than when count-based features are used in this experiment.

We mentioned that the final R script takes the output of the “Score Model” module and computes the log loss. We now show the results of this computation below:


The use of conditional counts (DRACuLa) features results in a compact representation of the high-cardinality categorical features present in the Criteo dataset. Moreover, we find that training times are about twice as fast after incorporating count features than without any count features. In addition, we find that the model performance is better on using the count-based DRACuLa features than without. This is because the count-based features provide a compact representation of the otherwise sparse high-dimensional categorical features.