Tutorial: Analyze sentiment of website comments with binary classification in ML.NET

2021-11-18

This tutorial shows you how to create a .NET console application that classifies sentiment from website comments and takes the appropriate action. The binary sentiment classifier uses C# in Visual Studio 2022.

In this tutorial, you learn how to:

Create a console application
Prepare data
Load the data
Build and train the model
Evaluate the model
Use the model to make a prediction
See the results

You can find the source code for this tutorial at the dotnet/samples repository.

Prerequisites

Visual Studio 2022.
UCI Sentiment Labeled Sentences dataset (ZIP file)

Create a console application

Create a C# Console Application called "SentimentAnalysis". Click the Next button.
Choose .NET 8 as the framework to use. Click the Create button.
Create a directory named Data in your project to save your data set files.
Install the Microsoft.ML NuGet Package:

Note

This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.

In Solution Explorer, right-click on your project and select Manage NuGet Packages. Choose "nuget.org" as the package source, and then select the Browse tab. Search for Microsoft.ML, select the package you want, and then select Install. Proceed with the installation by agreeing to the license terms for the package you choose.

Prepare your data

Note

The datasets for this tutorial are from the 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015, and hosted at the UCI Machine Learning Repository - Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Download UCI Sentiment Labeled Sentences dataset ZIP file, and unzip.
Copy the yelp_labelled.txt file into the Data directory you created.
In Solution Explorer, right-click the yelp_labelled.txt file and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.

Create classes and define paths

Add the following additional using directives to the top of the Program.cs file:

using Microsoft.ML;
using Microsoft.ML.Data;
using SentimentAnalysis;
using static Microsoft.ML.DataOperationsCatalog;

Add the following code to the line right below the using directives, to create a field to hold the recently downloaded dataset file path:
```
string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "yelp_labelled.txt");
```
Next, create classes for your input data and predictions. Add a new class to your project:
- In Solution Explorer, right-click the project, and then select Add > New Item.
- In the Add New Item dialog box, select Class and change the Name field to SentimentData.cs. Then, select Add.
The SentimentData.cs file opens in the code editor. Add the following using directive to the top of SentimentData.cs:
```
using Microsoft.ML.Data;
```

Remove the existing class definition and add the following code, which has two classes SentimentData and SentimentPrediction, to the SentimentData.cs file:

public class SentimentData
{
    [LoadColumn(0)]
    public string? SentimentText;

    [LoadColumn(1), ColumnName("Label")]
    public bool Sentiment;
}

public class SentimentPrediction : SentimentData
{

    [ColumnName("PredictedLabel")]
    public bool Prediction { get; set; }

    public float Probability { get; set; }

    public float Score { get; set; }
}

How the data was prepared

The input dataset class, SentimentData, has a string for user comments (SentimentText) and a bool (Sentiment) value of either 1 (positive) or 0 (negative) for sentiment. Both fields have LoadColumn attributes attached to them, which describes the data file order of each field. In addition, the Sentiment property has a ColumnName attribute to designate it as the Label field. The following example file doesn't have a header row, and looks like this:

SentimentText	Sentiment (Label)
Waitress was a little slow in service.	0
Crust is not good.	0
Wow... Loved this place.	1
Service was very prompt.	1

SentimentPrediction is the prediction class used after model training. It inherits from SentimentData so that the input SentimentText can be displayed along with the output prediction. The Prediction boolean is the value that the model predicts when supplied with new input SentimentText.

The output class SentimentPrediction contains two other properties calculated by the model: Score - the raw score calculated by the model, and Probability - the score calibrated to the likelihood of the text having positive sentiment.

For this tutorial, the most important property is Prediction.

Load the data

Data in ML.NET is represented as an IDataView interface. IDataView is a flexible, efficient way of describing tabular data (numeric and text). Data can be loaded from a text file or in real time (for example, SQL database or log files) to an IDataView object.

The MLContext class is a starting point for all ML.NET operations. Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in Entity Framework.

You prepare the app, and then load data:

Replace the Console.WriteLine("Hello World!") line with the following code to declare and initialize the mlContext variable:
```
MLContext mlContext = new MLContext();
```

Add the following as the next line of code:

TrainTestData splitDataView = LoadData(mlContext);

Create a LoadData() method at the bottom of the Program.cs file using the following code:
```
TrainTestData LoadData(MLContext mlContext)
{

}
```
The LoadData() method executes the following tasks:
- Loads the data.
- Splits the loaded dataset into train and test datasets.
- Returns the split train and test datasets.
Add the following code as the first line of the LoadData() method:
```
IDataView dataView = mlContext.Data.LoadFromTextFile<SentimentData>(_dataPath, hasHeader: false);
```
The LoadFromTextFile() method defines the data schema and reads in the file. It takes in the data path variables and returns an IDataView.

Split the dataset for model training and testing

When preparing a model, you use part of the dataset to train it and part of the dataset to test the model's accuracy.

To split the loaded data into the needed datasets, add the following code as the next line in the LoadData() method:
```
TrainTestData splitDataView = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
```
The previous code uses the TrainTestSplit() method to split the loaded dataset into train and test datasets and return them in the DataOperationsCatalog.TrainTestData class. Specify the test set percentage of data with the testFractionparameter. The default is 10%, in this case you use 20% to evaluate more data.
Return the splitDataView at the end of the LoadData() method:
```
return splitDataView;
```

Build and train the model

Add the following call to the BuildAndTrainModelmethod below the call to the LoadData method:
```
ITransformer model = BuildAndTrainModel(mlContext, splitDataView.TrainSet);
```
The BuildAndTrainModel() method executes the following tasks:
- Extracts and transforms the data.
- Trains the model.
- Predicts sentiment based on test data.
- Returns the model.
Create the BuildAndTrainModel() method, below the LoadData() method, using the following code:
```
ITransformer BuildAndTrainModel(MLContext mlContext, IDataView splitTrainSet)
{

}
```

Extract and transform the data

Call FeaturizeText as the next line of code:

var estimator = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentData.SentimentText))

The FeaturizeText() method in the previous code converts the text column (SentimentText) into a numeric key type Features column used by the machine learning algorithm and adds it as a new dataset column:

SentimentText	Sentiment	Features
Waitress was a little slow in service.	0	[0.76, 0.65, 0.44, …]
Crust is not good.	0	[0.98, 0.43, 0.54, …]
Wow... Loved this place.	1	[0.35, 0.73, 0.46, …]
Service was very prompt.	1	[0.39, 0, 0.75, …]

Add a learning algorithm

This app uses a classification algorithm that categorizes items or rows of data. The app categorizes website comments as either positive or negative, so use the binary classification task.

Append the machine learning task to the data transformation definitions by adding the following as the next line of code in BuildAndTrainModel():

.Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features"));

The SdcaLogisticRegressionBinaryTrainer is your classification training algorithm. This is appended to the estimator and accepts the featurized SentimentText (Features) and the Label input parameters to learn from the historic data.

Train the model

Fit the model to the splitTrainSet data and return the trained model by adding the following as the next line of code in the BuildAndTrainModel() method:

Console.WriteLine("=============== Create and Train the Model ===============");
var model = estimator.Fit(splitTrainSet);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();

The Fit() method trains your model by transforming the dataset and applying the training.

Return the model trained to use for evaluation

Return the model at the end of the BuildAndTrainModel() method:

return model;

Evaluate the model

After your model is trained, use your test data to validate the model's performance.

Create the Evaluate() method, just after BuildAndTrainModel(), with the following code:
```
void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)
{

}
```
The Evaluate() method executes the following tasks:
- Loads the test dataset.
- Creates the BinaryClassification evaluator.
- Evaluates the model and creates metrics.
- Displays the metrics.
Add a call to the new method below the BuildAndTrainModel method call using the following code:
```
Evaluate(mlContext, model, splitDataView.TestSet);
```
Transform the splitTestSet data by adding the following code to Evaluate():
```
Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
IDataView predictions = model.Transform(splitTestSet);
```
The previous code uses the Transform() method to make predictions for multiple provided input rows of a test dataset.

Evaluate the model by adding the following as the next line of code in the Evaluate() method:

CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");

Once you have the prediction set (predictions), the Evaluate() method assesses the model, which compares the predicted values with the actual Labels in the test dataset and returns a CalibratedBinaryClassificationMetrics object on how the model is performing.

Displaying the metrics for model validation

Use the following code to display the metrics:

Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");

The Accuracy metric gets the accuracy of a model, which is the proportion of correct predictions in the test set.
The AreaUnderRocCurve metric indicates how confident the model is correctly classifying the positive and negative classes. You want the AreaUnderRocCurve to be as close to one as possible.
The F1Score metric gets the model's F1 score, which is a measure of balance between precision and recall. You want the F1Score to be as close to one as possible.

Predict the test data outcome

Create the UseModelWithSingleItem() method, just after the Evaluate() method, using the following code:
```
void UseModelWithSingleItem(MLContext mlContext, ITransformer model)
{

}
```
The UseModelWithSingleItem() method executes the following tasks:
- Creates a single comment of test data.
- Predicts sentiment based on test data.
- Combines test data and predictions for reporting.
- Displays the predicted results.
Add a call to the new method right under the Evaluate() method call using the following code:
```
UseModelWithSingleItem(mlContext, model);
```
Add the following code to create as the first line in the UseModelWithSingleItem() Method:
```
PredictionEngine<SentimentData, SentimentPrediction> predictionFunction = mlContext.Model.CreatePredictionEngine<SentimentData, SentimentPrediction>(model);
```
The PredictionEngine is a convenience API, which allows you to perform a prediction on a single instance of data. PredictionEngine is not thread-safe. It's acceptable to use in single-threaded or prototype environments. For improved performance and thread safety in production environments, use the PredictionEnginePool service, which creates an ObjectPool of PredictionEngine objects for use throughout your application. See this guide on how to use PredictionEnginePool in an ASP.NET Core Web API.

Note

PredictionEnginePool service extension is currently in preview.
Add a comment to test the trained model's prediction in the UseModelWithSingleItem() method by creating an instance of SentimentData:
```
SentimentData sampleStatement = new SentimentData
{
    SentimentText = "This was a very bad steak"
};
```
Pass the test comment data to the PredictionEngine by adding the following as the next lines of code in the UseModelWithSingleItem() method:
```
var resultPrediction = predictionFunction.Predict(sampleStatement);
```
The Predict() function makes a prediction on a single row of data.

Display SentimentText and corresponding sentiment prediction using the following code:

Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

Console.WriteLine();
Console.WriteLine($"Sentiment: {resultPrediction.SentimentText} | Prediction: {(Convert.ToBoolean(resultPrediction.Prediction) ? "Positive" : "Negative")} | Probability: {resultPrediction.Probability} ");

Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();

Use the model for prediction

Deploy and predict batch items

Create the UseModelWithBatchItems() method, just after the UseModelWithSingleItem() method, using the following code:
```
void UseModelWithBatchItems(MLContext mlContext, ITransformer model)
{

}
```
The UseModelWithBatchItems() method executes the following tasks:
- Creates batch test data.
- Predicts sentiment based on test data.
- Combines test data and predictions for reporting.
- Displays the predicted results.
Add a call to the new method right under the UseModelWithSingleItem() method call using the following code:
```
UseModelWithBatchItems(mlContext, model);
```

Add some comments to test the trained model's predictions in the UseModelWithBatchItems() method:

IEnumerable<SentimentData> sentiments = new[]
{
    new SentimentData
    {
        SentimentText = "This was a horrible meal"
    },
    new SentimentData
    {
        SentimentText = "I love this spaghetti."
    }
};

Predict comment sentiment

Use the model to predict the comment data sentiment using the Transform() method:

IDataView batchComments = mlContext.Data.LoadFromEnumerable(sentiments);

IDataView predictions = model.Transform(batchComments);

// Use model to predict whether comment data is Positive (1) or Negative (0).
IEnumerable<SentimentPrediction> predictedResults = mlContext.Data.CreateEnumerable<SentimentPrediction>(predictions, reuseRowObject: false);

Combine and display the predictions

Create a header for the predictions using the following code:

Console.WriteLine();

Console.WriteLine("=============== Prediction Test of loaded model with multiple samples ===============");

Because SentimentPrediction is inherited from SentimentData, the Transform() method populated SentimentText with the predicted fields. As the ML.NET process processes, each component adds columns, and this makes it easy to display the results:

foreach (SentimentPrediction prediction  in predictedResults)
{
    Console.WriteLine($"Sentiment: {prediction.SentimentText} | Prediction: {(Convert.ToBoolean(prediction.Prediction) ? "Positive" : "Negative")} | Probability: {prediction.Probability} ");
}
Console.WriteLine("=============== End of predictions ===============");

Results

Your results should be similar to the following. During processing, messages are displayed. You may see warnings, or processing messages. These have been removed from the following results for clarity.

Model quality metrics evaluation
--------------------------------
Accuracy: 83.96%
Auc: 90.51%
F1Score: 84.04%

=============== End of model evaluation ===============

=============== Prediction Test of model with a single sample and test dataset ===============

Sentiment: This was a very bad steak | Prediction: Negative | Probability: 0.1027377
=============== End of Predictions ===============

=============== Prediction Test of loaded model with a multiple samples ===============

Sentiment: This was a horrible meal | Prediction: Negative | Probability: 0.1369192
Sentiment: I love this spaghetti. | Prediction: Positive | Probability: 0.9960636
=============== End of predictions ===============

=============== End of process ===============
Press any key to continue . . .

Congratulations! You've now successfully built a machine learning model for classifying and predicting messages sentiment.

Building successful models is an iterative process. This model has initial lower quality as the tutorial uses small datasets to provide quick model training. If you aren't satisfied with the model quality, you can try to improve it by providing larger training datasets or by choosing different training algorithms with different hyper-parameters for each algorithm.