How to use BigDL on Apache Spark for Azure HDInsight

Deep learning is impacting everything from healthcare, transportation, manufacturing, and more. Companies are turning to deep learning to solve hard problems, like image classification, speech recognition, object recognition, and machine translation. This blog post describes how to enable Intel’s BigDL Deep Learning Library for Apache Spark on Microsoft’s Azure HDInsight Platform.

In 2016, Intel released its BigDL distributed Deep Learning project into the open-source community, BigDL Github. It natively integrates into Spark, supports popular neural net topologies, and achieves feature parity with other open-source deep learning frameworks. BigDL also provides 100+ basic neural network building blocks allowing users to create novel topologies to suit their unique applications. Thus, with Intel’s BigDL, the users are able to leverage their existing Spark infrastructure to enable Deep Learning applications without having to invest into bringing up separate frameworks to take advantage of neural networks capabilities.

Since BigDL is an integral part of Spark, a user does not need to explicitly manage distributed computations. While providing a high-level control “knobs”, such as number of compute nodes, cores, and batch size, a BigDL application leverages stable Spark infrastructure for node communications and resource management during its execution. BigDL applications can be written in either Python or Scala and can achieve high performance through both algorithm optimization, taking advantage of intimate integration with Intel’s Math Kernel Library (MKL). Check out Intel’s BigDL portal for more details.

Azure HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Other than that, HDInsight is an open platform for 3rd party big data applications such as ISVs, as well as custom applications such as BigDL. Through this blog post, the BigDL and Azure HDInsight teams will walk you through how to use BigDL on top of HDInsight Spark.

Getting BigDL to work on HDInsight Spark

BigDL is very easy to build and integrate. The section below is largely based on the BigDL Documentation and there are two major steps:

  • Get BigDL source code and build it to get the required jar file
  • Use Jupyter Notebook to write your first BigDL application in Scala

There are a few additional steps in the blog post in order to illustrate how it can work with the MNIST dataset.

Before getting into the details, you can follow the HDInsight documentation to create an HDInsight Spark cluster on Azure in just a few minutes.

Step 1: Build BigDL libraries

The first step is to build the BigDL libraries. You can simply SSH into the cluster head node, and run the following script in your head node:
#!/bin/bash # The following piece of code takes around 10 mins to run. In the end, it will generate a jar named "/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar" where we can use in the programs or Jupyter applications. git clone https://github.com/intel-analytics/BigDL.git #install Maven as it is not installed by default sudo apt-get install -y maven #change maven setting based on BigDL documentation export BIGDL_ROOT=$(pwd)/BigDL export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" pushd ${BIGDL_ROOT} bash make-dist.sh -P spark_2.0 # put the compiled library to default container to be consumed from Jupyter Notebooks hadoop fs -put dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar / popd

Step 2: Get MNIST dataset

After finishing the first step, we can write our first BigDL application. Before getting into that, we want to download the MNIST dataset which we will consume later. Run the following script in your head node and see section below for the detailed explanation on this MNIST dataset.
#!/bin/bash # This scripts downloads the mnist data and unzips it. DIR="$( cd "$(dirname "$0")" ; pwd -P )" hadoop fs -mkdir /mnistdataset cd "$DIR" for fname in train-images-idx3-ubyte train-labels-idx1-ubyte t10k-images-idx3-ubyte t10k-labels-idx1-ubyte do if [ ! -e $fname ]; then wget --no-check-certificate https://yann.lecun.com/exdb/mnist/${fname}.gz gunzip ${fname}.gz hadoop fs -put ${fname} /mnistdata fi done

MNIST DataSet

MNIST dataset can be download using the script provided above, or from the MNIST Database. It is a small computer vision dataset used for writing simple machine learning applications, such as LeNet described later, akin to “hello world” exercise in programming courses. Learn more about LeNet. MNIST dataset contains handwritten digits which have been size-normalized and centered in a fixed-size image. It's a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

There are four files in the dataset: train-images-idx3-ubyte contains train images, train-labels-idx1-ubyte is a train label file, t10k-images-idx3-ubyte has validation images, and t10k-labels-idx1-ubyte contains validation labels. Put them in one folder (e.g. mnistdataset).

BigDL Workflow and Major Components for MNIST LeNet Example:

Below is a general workflow of how BigDL trains a deep learning model on Apache Spark:

BigDLArchitecture

As shown in the figure above, BigDL jobs are standard Spark jobs. In a distributed training process, BigDL will launch spark tasks in executor. Each task leverages Intel MKL to speed up training process.

A BigDL program starts with import com.intel.analytics.bigdl._ and then initializes theEngine (including the number of executor nodes, the number of physical cores on each executor):

1

If the program runs on Spark, Engine.init() will return a SparkConf with proper configurations populated, which can then be used to create the SparkContext. For this particular case, the Jupyter Notebook will automatically set up a default spark context so you don’t need to do the above configuration, but you do need to configure a few other Spark related configuration which will be explained in the Jupyter Notebook.

After the initialization, for LeNet example we need to:

1. Create the LeNet model by calling the LeNet5(), which creates the LeNet-5 convolutional network model as follows:

2

2. Load the data by creating theDataSet , and then applying a series ofTransformer (e.g., SampleToGreyImg, GreyImgNormalizer and GreyImgToBatch):

3

After that, we create theOptimizer by specifying the DataSet, the model and the Criterion (which, given input and target, computes gradient per given loss function):

4

Finally (after optionally specifying the validation data and methods for the Optimizer), we train the model by calling*Optimizer.optimize() * :

5
The complete and fully functional Python Jupyter Notebook is available to you.

Step 3: Use Jupyter notebook to run the LeNet application on MNIST sample

We provide a sample Jupyter notebook to demo how you can easily use Jupyter Notebook to develop your first BigDL application. This part is adapted from BigDL MNIST tutorial. Simply upload this Jupyter Notebook to your HDInsight cluster. Learn more about how to access Jupyter Notebook on HDInsight.

Running BigDL in regular Spark application

Sometimes users prefer to submit a jar file which contains the model, rather than running jobs in Jupyter Notebooks. In this case, you can use the example below to submit a job in YARN client mode.

This example is adapted from the BigDL documentation here (https://github.com/intel-analytics/BigDL/wiki/Getting-Started). Basically speaking, it downloads the CIFAR-10 dataset (https://www.cs.toronto.edu/\~kriz/cifar.html) and use the built-in VGG sample code to run against that dataset. You might want to adjust the executor number and the executor cores based on your cluster setting, as well as the batch number for hyper-parameter tuning.
cd ${BIGDL_ROOT} wget https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz tar xvzf cifar-10-binary.tar.gz ./dist/bin/bigdl.sh -- \ spark-submit --master yarn --deploy-mode client \ --executor-cores 4 \ --num-executors 4 \ --class com.intel.analytics.bigdl.models.vgg.Train \ dist/lib/bigdl-VERSION-SNAPSHOT-jar-with-dependencies.jar \ -f ${BIGDL_ROOT}/cifar-10-batches-bin \ -b 4

Conclusion

In this blog post, we have demonstrated how easy it is to set up a BigDL environment on HDInsight Spark. Leveraging the BigDL Spark library, a user can easily write scalable distributed Deep Learning applications within familiar Spark infrastructure without an intimate knowledge of the configuration of the underlying compute cluster. BigDL and Azure HDInsight team have been collaborating closely to enable BigDL in Azure HDInsight environment.

If you have any feedbacks for HDInsight, feel free to drop an email to hdifeedback@microsoft.com or post it in our MSDN forum. If you have any questions for BigDL, you can raise your questions in BigDL Google Group.

Resources