TensorFlow on Azure: Enabling Blob Storage via Alluxio

2018-05-01

Many customers Cloud AI Ecosystem in Microsoft works with, choose Azure Blob Storage as their data storage. Among those customers, if one wants to use TensorFlow to develop deep learning models, unfortunately TensorFlow does not support Azure Blob storage out of box as its custom file system plugin¹. There is no easy way to directly feed data from Azure block blobs into TensorFlow’s input pipeline². In case of setting up a Kubernetes cluster for TensorFlow workloads, Azure Blob Storage is not included in k8s’ supported Types of Volumes³.

Given above, the odds are either to mount an Azure File share to Kubernetes pods and read remote Azure files via the mount path, or to manually copy data to a local SSD disk of each pod. Those approaches are summarized as part of ‘Deep Learning Toolkits with Kubernetes Clusters’ published at https://aka.ms/deeplearningk8s; however, many people would prefer Azure Blob to Azure File service because of their different performance, scale and pricing options⁵; or in the case of manual copy to local SSD, it’s not scalable to repeat that operation with big amount of training data.

In this blog we introduce Alluxio’s newly released FUSE⁶ feature in Alluxio 1.7, which enables mounting Azure Blob storage to the local file system namespace, and solves integration between TensorFlow and Azure Blobs. It aims to bridge high computation workloads including TensorFlow jobs with underlying storage system, via its unified Alluxio data access layer. Alluxio-FUSE feature opens up new opportunities for Azure Blobs to be directly fed into your tensors; moreover, with current progress made on GPU computation, the input pipeline might become a bottleneck if the storage is not performant enough. The effort Alluxio has paid to optimize data access layer brings positive impact to the DL data input pipeline. For more details, please refer to https://alluxio.com/blog/flexible-and-fast-storage-for-deep-learning-with-alluxio.

Follow the simple steps below to check out how to enable Azure Blobs via Alluxio-FUSE to run TensorFlow jobs on Azure.

Set up A Kubernetes Cluster

A sample k8s cluster on Azure Container Service is deployed using open source toolkit DLWorkspace⁸, the documentation can be found at https://microsoft.github.io/DLWorkspace/. The sample setup includes one master node of Standard D2 v2 Azure VM(2 vcpus, 7GB memory), and two agent nodes of Standard NC12(12 vcpus, 112 GB memory) Azure VM. To check if the GPU driver is correctly installed after deployment, run ‘nvidia-smi’ on each agent node to find the driver information.

Create Alluxio-FUSE Enabled Pods

For ease of use, Alluxio includes Docker integration⁹ and published its 1.7 Docker images¹⁰ on Docker Hub, we can pull images into the k8s cluster and create Alluxio-FUSE enabled k8s pods. Alluxio servers consist of two architectural components¹¹: master and workers, where the master is responsible for managing global metadata and workers are responsible for managing local storage resources allocated to Alluxio. We co-locate Alluxio master and TensorFlow parameter server on the same pod, also Alluxio workers and TensorFlow workers on the same pod, for better data locality.

Find sample pod configuration files posted at https://github.com/jichang1/TensorFlowonAzure/tree/master/Alluxio to create your k8s pods: first tf-ps pod and then tf-worker pod. Note you will replace $yourcontainername$, $yourstorageaccountname$, and $yourstorageaccountkey$ . Replace $yourpsserverip$ with the IP found from /etc/hosts of the tf-ps pod.

The sample container configuration below, tells that the docker image runs /entrypoint.sh upon initialization with argument “worker”; the worker pod communicates with the master pod using port 19998; a few environment variables need to be defined upon initialization such as master host name, storage account, etc; the process runs in privileged ‘SYS_ADMIN’ mode.

 containers:

      - name: tf-worker0

        image: alluxio/alluxio-tensorflow:1.7.0-1.3.0-gpu

        command: ["/entrypoint.sh"]

        args: ["worker"]

        ports:

          - containerPort: 19998

          name: alluxioport

        env:

        - name: ALLUXIO_MASTER_HOSTNAME

          value: "$yourpsserverip$"

        - name: ALLUXIO_RAM_FOLDER

          value: "/opt/ramdisk"

        - name: ALLUXIO_WORKER_MEMORY_SIZE

          value: "10GB"

        - name: ALLUXIO_UNDERFS_ADDRESS

          value: "wasb://$yourcontainername$@$yourstorageaccountname$.blob.core.windows.net/"

        - name: FS_AZURE_ACCOUNT_KEY_$yourstorageaccountname$_BLOB_CORE_WINDOWS_NET

          value: $yourstorageaccountkey$

        resources:

          limits:

            alpha.kubernetes.io/nvidia-gpu: 1

        volumeMounts:

        - mountPath: /usr/local/nvidia

          name: nvidia-driver

        - mountPath: /opt/ramdisk

          name: ramdisk

        - mountPath: /etc/resolv.conf

          name: resolv

        securityContext:

          privileged: true

          capabilities:

            add: ["SYS_ADMIN"]

            nodeSelector:

        FragmentGPUJob: active

      volumes:

      - name: nvidia-driver

        hostPath:

          path: /opt/nvidia-driver/current

      - name: ramdisk

        hostPath:

          path: /mnt/ramdisk

      - name: resolv

        hostPath:

          path: /etc/resolv.conf

After executing

 sudo kubectl apply -f ./alluxio-fuse-tfgpu-psserver0.yaml

sudo kubectl apply -f ./alluxio-fuse-tfgpu-worker0.yaml

Run

 kubectl get pods

to check pods are up and in healthy state.

Connect to tf-ps or tf-worker pods, access your blob storage via ‘ls /alluxio-fuse’ which is already mounted to local file system.

Run TensorFlow Jobs

We take TensorFlow benchmark jobs¹² as an example.

On the parameter server pod, run the command below:

 python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=2  --batch_size=128  --model=googlenet   --variable_update=parameter_server --num_batches=50  --cross_replica_sync=False  --data_name=imagenet --data_dir=file:///alluxio-fuse/  --job_name=ps --ps_hosts=10.244.2.2:2222  --worker_hosts=10.244.0.2:2222 --task_index=0

On the worker pod, run the command below:

 python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=2  --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=50 --cross_replica_sync=False --data_name=imagenet --data_dir=file:///alluxio-fuse/  --job_name=worker  --ps_hosts=10.244.2.2:2222  --worker_hosts=10.244.0.2:2222 --task_index=0

You should observe output similar to below

 name: Tesla K80

major: 3 minor: 7 memoryClockRate (GHz) 0.8235

pciBusID 0e7c:00:00.0

Total memory: 11.17GiB

Free memory: 11.09GiB

……

2018-01-11 02:22:52.188017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0e7c:00:00.0)

2018-01-11 02:22:52.188038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 27e6:00:00.0)

2018-01-11 02:22:52.402467: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.244.2.2:2222}

2018-01-11 02:22:52.402510: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}

2018-01-11 02:22:52.405246: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222

TensorFlow:  1.3

Model:       googlenet

Mode:        training

Batch size:  256 global

             128 per device

Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']

Data format: NCHW

Optimizer:   sgd

Variables:   parameter_server

Sync:        False

==========

Generating model

2018-01-11 02:24:26.461062: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session b26ef5a1286e9840 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true

Running warm up

Done warm up

Waiting for other replicas to finish warm up

Starting real work at step 10 at time Thu Jan 11 02:25:57 2018

Step    Img/sec loss

1       images/sec: 200.3 +/- 0.0 (jitter = 0.0)        7.093

10      images/sec: 189.8 +/- 1.6 (jitter = 4.7)        7.093

20      images/sec: 186.3 +/- 1.3 (jitter = 6.3)        7.093

30      images/sec: 186.6 +/- 1.1 (jitter = 6.0)        7.093

40      images/sec: 186.8 +/- 0.9 (jitter = 5.5)        7.093

Finishing real work at step 59 at time Thu Jan 11 02:27:04 2018

50      images/sec: 187.4 +/- 0.8 (jitter = 5.3)        7.093

----------------------------------------------------------------

total images/sec: 186.67

----------------------------------------------------------------

We hope this blog has provided you a new way of running TensorFlow jobs on Azure with underlying Azure Blob storage.