Tutorial: Sentiment analysis with .NET for Apache Spark and ML.NET

This tutorial teaches you how to do sentiment analysis of online reviews using ML.NET and .NET for Apache Spark. ML.NET is a free, cross-platform, open-source machine learning framework. You can use ML.NET with .NET for Apache Spark to scale the training and prediction of machine learning algorithms.

In this tutorial, you learn how to:

  • Create a sentiment analysis model using ML.NET Model Builder in Visual Studio.
  • Create a .NET for Apache Spark console app.
  • Write and implement a user-defined function.
  • Run a .NET for Apache Spark console app.

Warning

.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.

Prerequisites

  • If you haven't developed a .NET for Apache Spark application before, start with the Getting Started tutorial to become familiar with the basics. Complete all of the prerequisites for the Getting Started tutorial before you continue with this tutorial.

  • This tutorial uses the ML.NET Model Builder (preview), a visual interface available in Visual Studio. If you don't already have Visual Studio, you can download the Community version of Visual Studio for free.

  • Download and install ML.NET Model Builder (preview).

  • Download the yelptest.csv and yelptrain.csv Yelp review datasets.

Review the data

The Yelp reviews dataset contains online Yelp reviews about various services. Open yelptrain.csv and notice the structure of the data. The first column contains review text, and the second column contains sentiment scores. If the sentiment score is 1, the review is positive, and if the sentiment score is 0, the review is negative.

The following table contains sample data:

ReviewText Sentiment
Wow... Loved this place. 1
Crust is not good. 0

Build your machine learning model

  1. Open Visual Studio and create a new C# Console App (.NET Core). Name the project MLSparkModel.

  2. In Solution Explorer, right-click the MLSparkModel project. Then select Add > Machine Learning.

  3. From the ML.NET Model Builder, select the Sentiment Analysis scenario tile.

  4. On the Add data page, upload the yelptrain.csv data set.

  5. Choose Sentiment from the Columns to Predict dropdown.

  6. On the Train page, set the time to train to 60 seconds and select Start training. Notice the status of your training under Progress.

  7. Once Model Builder is finished training, Evaluate the training results. You can type phrases into the text box below Try your model and select Predict to see the output.

  8. Select Code and then select Add Projects to add the ML model to the solution.

  9. Notice that two projects are added to your solutions: MLSparkModelML.ConsoleApp and MLSparkModelML.Model.

  10. Double-click on your MLSpark C# project and notice that the following project reference has been added.

    <ItemGroup>
        <ProjectReference Include="..\MLSparkModelML.Model\MLSparkModelML.Model.csproj" />
    </ItemGroup>
    

Create a console app

Model Builder creates a console app for you.

  1. Right-click on MLSparkModelML.Console in Solution Explorer, and select Manage NuGet Packages.

  2. Search for Microsoft.Spark and install the package. Microsoft.ML is automatically installed for you by Model Builder.

Create a SparkSession

  1. Open the Program.cs file for MLSparkModelML.ConsoleApp. This file was autogenerated by Model Builder. Delete the using statements, the contents of the Main() method, and the CreateSingleDataSample region.

  2. Add the following additional using statements to the top of the Program.cs:

    using System;
    using System.Collections.Generic;
    using Microsoft.ML;
    using Microsoft.ML.Data;
    using Microsoft.Spark.Sql;
    using MLSparkModelML.Model;
    
  3. Change the DATA_FILEPATH to the path of your yelptest.csv.

  4. Add the following code to your Main method to create a new SparkSession. The Spark Session is the entry point to programming Spark with the Dataset and DataFrame API.

    SparkSession spark = SparkSession
         .Builder()
         .AppName(".NET for Apache Spark Sentiment Analysis")
         .GetOrCreate();
    

    Calling the spark object created above allows you to access Spark and DataFrame functionality throughout your program.

Create a DataFrame and print to console

Read in the Yelp review data from the yelptest.csv file as a DataFrame. Include header and inferSchema options. The header option reads the first line of yelptest.csv as column names instead of data. The inferSchema option infers column types based on the data.

DataFrame df = spark
    .ReadStream()
    .Option("header", true)
    .Option("inferSchema", true)
    .Csv(DATA_FILEPATH);

df.Show();

Register a user-defined function

You can use UDFs, user-defined functions, in Spark applications to do calculations and analysis on your data. In this tutorial, you use ML.NET with a UDF to evaluate each Yelp review.

Add the following code to your Main method to register a UDF called MLudf.

spark.Udf()
    .Register<string, bool>("MLudf", predict);

This UDF takes a Yelp review string as input, and outputs true or false for positive or negative sentiments, respectively. It uses the predict() method that you define in a later step.

Use Spark SQL to call the UDF

Now that you've read in your data and incorporated ML, use Spark SQL to call the UDF that will run sentiment analysis on each row of your DataFrame. Add the following code to your Main method:

// Use Spark SQL to call ML.NET UDF
// Display results of sentiment analysis on reviews
df.CreateOrReplaceTempView("Reviews");
DataFrame sqlDf = spark.Sql("SELECT ReviewText, MLudf(ReviewText) FROM Reviews");
sqlDf.Show();

// Print out first 20 rows of data
// Prevent data getting cut off by setting truncate = 0
sqlDf.Show(20, 0, false);

spark.Stop();

Create predict() method

Add the following code before your Main() method. This code is similar to what is produced by Model Builder in ConsumeModel.cs. Moving this method to your console loads the model loading each time you run your app.

private static readonly PredictionEngine<ModelInput, ModelOutput> _predictionEngine;

static Program()
{
    MLContext mlContext = new MLContext();
    ITransformer model = mlContext.Model.Load("MLModel.zip", out DataViewSchema schema);
    _predictionEngine = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(model);
}

static bool predict(string text)
{
    ModelInput input = new ModelInput
    {
        ReviewText = text
    };

    return _predictionEngine.Predict(input).Prediction;
}

Add the model to your console app

In Solution Explorer, copy the MLModel.zip file from the MLSparkModelML.Model project and paste it in the MLSparkModelML.ConsoleApp project. A reference is automatically added in MLSparkModelML.ConsoleApp.csproj.

Run your code

Use spark-submit to run your code. Navigate to your console app's root folder using the command prompt and run the following commands.

First, clean and publish your app.

dotnet clean
dotnet publish

Then navigate to the console app's publish folder and run the following spark-submit command. Remember to update the command with the actual path of your Microsoft Spark jar file.

%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2-4_2.11-1.0.0.jar dotnet MLSparkModelML.ConsoleApp.dll

Get the code

This tutorial is similar to the code from the Sentiment Analysis with Big Data example.

Next steps

Advance to the next article to learn how to do Structured Streaming with .NET for Apache Spark.