Microsoft ML on Spark and Hadoop

MicrosoftML is a new package for Microsoft R Server that adds state-of-the-art algorithms and data transforms to Microsoft R Server functionality. The MicrosoftML package was available in Microsoft R Server for Windows and in SQL Server vNext. Now we bring the power of these algorithms to Spark and Hadoop.

Training on a Hadoop/Spark cluster occurs in a parallel manner on worker nodes using ensembling. You can also use ensembling to combine multiple models will be the subject of a future blog post.

Below is an example of how to use regression using rxFastTrees() in Spark.

In this example, we will:

  1. Create a connection to Spark using rxSparkConnect()
  2. Create a text data source from the AirlineDemoSmall.csv data that you should have copied to HDFS at /user/RevoScaleR/<your username> directory
  3. Build an ensemble model using rxFastTrees(). To use Spark, you need to pass in an ensemble object to rxFastTrees(). The Spark cluster is then utilized to build the ensemble models in parallel

For a comprehensive view of all the capabilities in Microsoft R Server 9.1, refer to this blog.

Authors: Regi John and Premal Shah