Tutorial: Sentiment analysis with .NET for Apache Spark and ML.NET
This tutorial teaches you how to do sentiment analysis of online reviews using ML.NET and .NET for Apache Spark. ML.NET is a free, cross-platform, open-source machine learning framework. You can use ML.NET with .NET for Apache Spark to scale the training and prediction of machine learning algorithms.
In this tutorial, you learn how to:
- Create a sentiment analysis model using ML.NET Model Builder in Visual Studio.
- Create a .NET for Apache Spark console app.
- Write and implement a user-defined function.
- Run a .NET for Apache Spark console app.
Warning
.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.
Prerequisites
If you haven't developed a .NET for Apache Spark application before, start with the Getting Started tutorial to become familiar with the basics. Complete all of the prerequisites for the Getting Started tutorial before you continue with this tutorial.
This tutorial uses the ML.NET Model Builder (preview), a visual interface available in Visual Studio. If you don't already have Visual Studio, you can download the Community version of Visual Studio for free.
Download and install ML.NET Model Builder (preview).
Download the yelptest.csv and yelptrain.csv Yelp review datasets.
Review the data
The Yelp reviews dataset contains online Yelp reviews about various services. Open yelptrain.csv and notice the structure of the data. The first column contains review text, and the second column contains sentiment scores. If the sentiment score is 1, the review is positive, and if the sentiment score is 0, the review is negative.
The following table contains sample data:
ReviewText | Sentiment |
---|---|
Wow... Loved this place. | 1 |
Crust is not good. | 0 |
Build your machine learning model
Open Visual Studio and create a new C# Console App (.NET Core). Name the project MLSparkModel.
In Solution Explorer, right-click the MLSparkModel project. Then select Add > Machine Learning.
From the ML.NET Model Builder, select the Sentiment Analysis scenario tile.
On the Add data page, upload the yelptrain.csv data set.
Choose Sentiment from the Columns to Predict dropdown.
On the Train page, set the time to train to 60 seconds and select Start training. Notice the status of your training under Progress.
Once Model Builder is finished training, Evaluate the training results. You can type phrases into the text box below Try your model and select Predict to see the output.
Select Code and then select Add Projects to add the ML model to the solution.
Notice that two projects are added to your solutions: MLSparkModelML.ConsoleApp and MLSparkModelML.Model.
Double-click on your MLSpark C# project and notice that the following project reference has been added.
<ItemGroup> <ProjectReference Include="..\MLSparkModelML.Model\MLSparkModelML.Model.csproj" /> </ItemGroup>
Create a console app
Model Builder creates a console app for you.
Right-click on MLSparkModelML.Console in Solution Explorer, and select Manage NuGet Packages.
Search for Microsoft.Spark and install the package. Microsoft.ML is automatically installed for you by Model Builder.
Create a SparkSession
Open the Program.cs file for MLSparkModelML.ConsoleApp. This file was autogenerated by Model Builder. Delete the
using
statements, the contents of the Main() method, and theCreateSingleDataSample
region.Add the following additional
using
statements to the top of the Program.cs:using System; using System.Collections.Generic; using Microsoft.ML; using Microsoft.ML.Data; using Microsoft.Spark.Sql; using MLSparkModelML.Model;
Change the
DATA_FILEPATH
to the path of your yelptest.csv.Add the following code to your
Main
method to create a newSparkSession
. The Spark Session is the entry point to programming Spark with the Dataset and DataFrame API.SparkSession spark = SparkSession .Builder() .AppName(".NET for Apache Spark Sentiment Analysis") .GetOrCreate();
Calling the spark object created above allows you to access Spark and DataFrame functionality throughout your program.
Create a DataFrame and print to console
Read in the Yelp review data from the yelptest.csv file as a DataFrame
. Include header
and inferSchema
options. The header
option reads the first line of yelptest.csv as column names instead of data. The inferSchema
option infers column types based on the data.
DataFrame df = spark
.ReadStream()
.Option("header", true)
.Option("inferSchema", true)
.Csv(DATA_FILEPATH);
df.Show();
Register a user-defined function
You can use UDFs, user-defined functions, in Spark applications to do calculations and analysis on your data. In this tutorial, you use ML.NET with a UDF to evaluate each Yelp review.
Add the following code to your Main
method to register a UDF called MLudf
.
spark.Udf()
.Register<string, bool>("MLudf", predict);
This UDF takes a Yelp review string as input, and outputs true or false for positive or negative sentiments, respectively. It uses the predict() method that you define in a later step.
Use Spark SQL to call the UDF
Now that you've read in your data and incorporated ML, use Spark SQL to call the UDF that will run sentiment analysis on each row of your DataFrame. Add the following code to your Main
method:
// Use Spark SQL to call ML.NET UDF
// Display results of sentiment analysis on reviews
df.CreateOrReplaceTempView("Reviews");
DataFrame sqlDf = spark.Sql("SELECT ReviewText, MLudf(ReviewText) FROM Reviews");
sqlDf.Show();
// Print out first 20 rows of data
// Prevent data getting cut off by setting truncate = 0
sqlDf.Show(20, 0, false);
spark.Stop();
Create predict() method
Add the following code before your Main()
method. This code is similar to what is produced by Model Builder in ConsumeModel.cs. Moving this method to your console loads the model loading each time you run your app.
private static readonly PredictionEngine<ModelInput, ModelOutput> _predictionEngine;
static Program()
{
MLContext mlContext = new MLContext();
ITransformer model = mlContext.Model.Load("MLModel.zip", out DataViewSchema schema);
_predictionEngine = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(model);
}
static bool predict(string text)
{
ModelInput input = new ModelInput
{
ReviewText = text
};
return _predictionEngine.Predict(input).Prediction;
}
Add the model to your console app
In Solution Explorer, copy the MLModel.zip file from the MLSparkModelML.Model project and paste it in the MLSparkModelML.ConsoleApp project. A reference is automatically added in MLSparkModelML.ConsoleApp.csproj.
Run your code
Use spark-submit
to run your code. Navigate to your console app's root folder using the command prompt and run the following commands.
First, clean and publish your app.
dotnet clean
dotnet publish
Then navigate to the console app's publish folder and run the following spark-submit
command. Remember to update the command with the actual path of your Microsoft Spark jar file.
%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2-4_2.11-1.0.0.jar dotnet MLSparkModelML.ConsoleApp.dll
Get the code
This tutorial is similar to the code from the Sentiment Analysis with Big Data example.
Next steps
Advance to the next article to learn how to do Structured Streaming with .NET for Apache Spark.