Install Machine Learning Server for Hadoop

Important

This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?

Applies to: Machine Learning Server 9.4

On a Spark cluster, Machine Learning Server must be installed on the edge node and all data nodes on a commercial distribution of Hadoop: Cloudera, HortonWorks, MapR. Optionally, you can install operationalization features on edge nodes only.

Machine Learning Server is engineered for the following architecture:

Hadoop Distributed File System (HDFS)
Apache YARN
MapReduce or Spark 2.4

We recommend Spark for the processing framework.

Note

These instructions use package managers to connect to Microsoft sites, download the distributions, and install the server.

System and setup requirements

Native operating system must be a supported version of Hadoop on 64-bit Linux.
Minimum RAM is 8 GB (16 GB or more is recommended). Minimum disk space is 500 MB per node.
An internet connection. If you do not have an internet connection, use the offline installation instructions.
Root or super user permissions

Package managers

Installation is through package managers. Unlike previous releases, there is no install.sh script.

Package manager	Platform
yum	RHEL, CentOS
apt	Ubuntu online
dpkg	Ubuntu offline
zypper	SUSE
rpm	RHEL, CentOS, SUSE

Running setup on existing installations

The installation path for Machine Learning Server is new: /opt/microsoft/mlserver/9.4.7. However, if R Server 9.x is present, Machine Learning Server 9.x finds R Server at the old path (/usr/lib64/microsoft-r/9.1.0) and replaces it with the new version.

There is no support for side-by-side installations of older and newer versions, nor is there support for hybrid versions (such as R Server 9.1 and Machine Learning Server 9.4). An installation is either entirely 9.4 or an earlier version.

Installation paths

After installation completes, software can be found at the following paths:

Install root: /opt/microsoft/mlserver/9.4.7
Microsoft R Open root: /opt/microsoft/ropen/3.5.2
Executables such as Revo64 and mlserver-python are at /usr/bin

1 - Edge node installation

Start here. Machine Learning Server is required on the edge node. You should run full setup, following the installation commands for the Linux operating system used by your cluster: Linux install > How to install.

Full setup gives you core components for both R and Python, machine learning algorithms and pretrained models, and operationalization. Operationalization features run on edge nodes, enabling additional ways of deploying and consuming script. For example, you can build and deploy web services, which allows you to invoke and access your solution programmatically, through a REST API.

Note

You cannot use operationalization on data nodes. Operationalization does not support Yarn queues and cannot run in a distributed manner.

2 - Data node installation

You can continue installation by running Setup on any data node, either sequentially or on multiple data nodes concurrently. There are two approaches for installing Machine Learning Server on data nodes.

Approach 1: Package managers for full installation

Again, we recommend running the full setup on every node. This approach is fast because package managers do most of the work, including adding the Hadoop package (microsoft-mlserver-hadoop-9.4.7) and setting it up for activation.

As before, follow the installation steps for the Linux operating system used by your cluster: Linux install > How to install.

Approach 2: Manual steps for partial installation

Alternatively, you can install a subset of packages. You might do this if you do not want operationalization on your data nodes, or if you want to exclude a specific language. Be prepared for more testing if you choose this approach. The packages are not specifically designed to run as standalone modules. Hence, unexpected problems are more likely if you leave some packages out.

Install as root: sudo su
Refer to the annotated package list and download individual packages from the package repo corresponding to your platform:
Make a directory to contain your packages: hadoop fs -mkdir /tmp/mlsdatanode
Copy the packages: hadoop fs -copyFromLocal /tmp/mlserver /tmp/mlsdatanode
Switch to the directory: cd /tmp/mlsdatanode
Install the packages using the tool and syntax for your platform:
- On Ubuntu online: apt-get install *.rpm
- On Ubuntu offline: dpkg -i *.deb
- On CentOS and RHEL: yum install *.rpm
Activate the server: /opt/microsoft/mlserver/9.4.7/bin/R/activate.sh

Repeat this procedure on remaining nodes.

Packages list

The following packages comprise a full Machine Learning Server installation:

 microsoft-mlserver-packages-r-9.4.7        ** core
 microsoft-mlserver-python-9.4.7            ** core
 microsoft-mlserver-packages-py-9.4.7       ** core
 microsoft-mlserver-hadoop-9.4.7            ** hadoop (required for hadoop)
 microsoft-mlserver-mml-r-9.4.7             ** microsoftml for R (optional)
 microsoft-mlserver-mml-py-9.4.7            ** microsoftml for Python (optional)
 microsoft-mlserver-mlm-r-9.4.7             ** pre-trained models (requires mml)
 microsoft-mlserver-mlm-py-9.4.7            ** pre-trained models (requires mml)
 microsoft-mlserver-adminutil-9.4.7         ** operationalization (optional)
 microsoft-mlserver-computenode-9.4.7       ** operationalization (optional)
 microsoft-mlserver-config-rserve-9.4.7     ** operationalization (optional) 
 microsoft-mlserver-dotnet-9.4.7            ** operationalization (optional)
 microsoft-mlserver-webnode-9.4.7           ** operationalization (optional)
 azure-cli-2.0.25-1.el7.x86_64              ** operationalization (optional)

The microsoft-mlserver-python-9.4.7 package provides Miniconda 4.5.12 with Python 3.7.1, executing as mlserver-python, found in /opt/microsoft/mlserver/9.4.7/bin/python/python

Microsoft R Open is required for R execution:

 microsoft-r-open-foreachiterators-3.5.2 
 microsoft-r-open-mkl-3.5.2
 microsoft-r-open-mro-3.5.2

Microsoft .NET Core 2.0, used for operationalization, must be added to Ubuntu:

 dotnet-host-2.0.0
 dotnet-hostfxr-2.0.0
 dotnet-runtime-2.0.0

Additional open-source packages could be required. The potential list of packages varies for each computer. Refer to offline installation for an example list.

Next steps

We recommend starting with How to use RevoScaleR with Spark or How to use RevoScaleR with Hadoop MapReduce.

For a list of functions that utilize Yarn and Hadoop infrastructure to process in parallel across the cluster, see Running a distributed analysis using RevoScaleR functions.

R solutions that execute on the cluster can call functions from any R package. To add new R packages, you can use any of these approaches:

Use the RevoScaleR rxExec function to add new packages.
Manually run install.packages() on all nodes in Hadoop cluster (using distributed shell or some other mechanism).