October 2015

Volume 30 Number 10

Test Run - Linear Discriminate Analysis Using C#

By James McCaffrey

The goal of a binary classification problem is to predict the value of a variable that can take on one of two possible values. For example, you might want to predict the change in the stock price of some company (increase or decrease) based on predictor variables like average number of shares sold, number of insider transactions, price to earnings ratio and so forth. Or you might want to predict the political inclination of a person (liberal or conservative) based on age, income, education level and so on.

There are roughly a dozen major algorithms that can be used for binary classification. Linear discriminate analysis (LDA) is one of the oldest approaches, dating from the 1930s. This article explains what LDA is, describes how it works and shows you how to imple­ment LDA with code.

Realistically, it’s unlikely you’ll ever need to write LDA code. Never­theless, there are three reasons you might find this article useful. First, understanding how to program LDA gives you a thorough grasp of how LDA works in case you ever come across it. Second, some of the coding techniques used when implementing LDA can be useful in other, more common programming scenarios. Finally, the ideas behind LDA are extraordinarily clever and you might find LDA interesting for its own sake.

The best way to get a feel for what LDA binary classification is and to see where this article is headed is to take a look at the demo program in Figure 1.

Linear Descriminate Analysis Binary Classification Demo
Figure 1 Linear Descriminate Analysis Binary Classification Demo

The goal of the demo program is to predict the political inclination, liberal or conservative, of a person based on the person’s age and annual income. The demo uses a tiny training data set with just eight items to create the LDA prediction model. Both age and income have been normalized in some way so that their magnitudes are both roughly the same.

Political inclination has been encoded so that liberal is 0 and conservative is 1. Unlike many binary classification algorithms, LDA can use any type of encoding for the variable to predict. So political inclination could have been encoded as -1 and +1, or “A” and “B.”

Basic LDA binary classification can work with any number of predictor variables. And it’s possible to extend basic LDA to predict a variable that can take one of three or more values. For example, you could predict political inclination where the possible values are liberal, conservative and moderate. This article describes only binary classification.

The key to LDA is something called the linear discriminate, usually represented by lowercase “w.” Using the eight data items, the demo calculates w = (-0.868, -0.496). The w array will have the same number of values as there are predictor variables. When computing w, behind the scenes the demo calculates the means for each of the two classes, then uses the means to calculate scatter matrices for each class, and finally uses the scatter matrices to calculate a combined within-class scatter matrix. The within-class matrix is needed to calculate w.

After calculating w, the demo uses it to predict the class for a new person who has normalized age = 4.0 and normalized income = 5.0. The LDA prediction is that the person has a liberal political inclination.

This article assumes you have at least intermediate programming skills, but doesn’t assume you know anything about the LDA classification algorithm. The demo is coded using C#, but you shouldn’t have much trouble if you want to refactor the code to another language such as Visual Basic .NET or JavaScript.

Understanding LDA Binary Classification

The main ideas of LDA binary classification are illustrated in the two graphs shown in Figure 2. The top graph shows the data from the demo program. The three blue dots represent the three people who are liberal. The five red dots are the conservatives. The yellow dot at (4.0, 5.0) represents a person with unknown political inclination. Even for such a simple problem, it’s not obvious if the unknown person should be classified as a liberal or a conservative.

LDA Prediction Using the Discriminate Vector

Figure 2 LDA Prediction Using the Discriminate Vector

Note that the only reason the demo data could be graphed as shown in Figure 2 is that there are just two predictor variables. If there were more than two predictor variables, it wouldn’t be possible to see the data so nicely in a 2D graph, but the same principles of LDA would apply. No matter how many predictor variables there are in binary LDA, there will only be two classes. It’s easy to confuse the number of predictor variables (two or more) with the number of values that the variable to predict can take (always exactly two for binary classification).

Most binary classification algorithms would try to find a line (technically a vector) that explicitly separates the two classes. For example, in the top graph in Figure 2, you can imagine a hypothetical separating line that runs from about (3, 9) down to (5, 0). Any unknown data point on the left side of the line would be classified as liberal, and any point on the right side would be a conservative. LDA works in an entirely different way.

The first step in LDA is to find a line called the discriminate. In Figure 2, the discriminate line is colored green, and the discriminate is identified as (0.868, 0.496). In LDA, the discriminate line is always identified by a single end point and the starting point is always implicitly (0, 0, . ., 0).

But wait a minute; the output of the demo program in Figure 1 shows the discriminate is (-0.868, -0.496) rather than (+0.868, +0.496). As it turns out, you can multiply the components of the discriminate by any constant and there will be no effect on the result. Therefore, to make the bottom graph in Figure 2 look nicer and to illustrate this important idea, I used (+0.868, +0.496) for w rather than the actual calculated values, which were negative. In other words, I multiplied both components by -1.

Put another way, the discriminate could be identified by any point on the green line in Figure 2, such as -2 * (-0.868, -0.496) = (1.736, 0.992). In LDA, the standard, but by no means universal, approach is to scale the discriminate so that it has length 1 from the origin. Notice that the length from (0, 0) to (0.868, 0.496) = sqrt( (0 - 0.868)^2 + (0 - 0.496)^2 ) = sqrt(0.754 + 0.246) = sqrt(1.000) = 1.

But what’s the significance of the discriminate vector? In Figure 2, there are black dashed lines from each data point projected onto the discriminate line, where the dashed lines are perpendicular to the discriminate. As it turns out, the discriminate is the one line, starting at (0, 0) that simultaneously minimizes the distance of the projected points for each class, and maximizes the distance between the two groups of projected points. This idea is not at all obvious.

OK, but even if it’s possible to calculate the discriminate vector for a set of data, it’s still not clear how the discriminate can be used to make a prediction. In Figure 2, the purple diamond labeled “mean” is the average data point of the two class means. You can see the mean is halfway between the two groups of data points. Now imagine a perpendicular dashed line from the purple mean down to the green discriminate line. The projection line would hit at about (5.2, 3.0).

And now imagine a perpendicular line from the yellow point to classify down to the green discriminate line. Because the projection of the point to classify falls to the left of the projection from the mean, it’s closer to the projections of the liberal data points, and therefore is classified as liberal. If the projection from the unknown point to the discriminate line had fallen on the other side of the projection from the mean, the point would have been classified as conservative. An incredibly clever idea.

Implementing LDA Binary Classification

The overall structure of the demo program, with a few WriteLine statements removed and minor edits to save space, is presented in Figure 3. To create the demo program, I launched Visual Studio and created a new C# console application named LinearDiscriminate. The demo has no significant .NET version dependencies so any version of Visual Studio should work. After the template code loaded into the editor, I deleted all using statements except for the single reference to the top-level System namespace. In the Solution Explorer window, I renamed file Program.cs to LinearDiscriminateProgram.cs and allowed Visual Studio to automatically rename class Program for me.

Figure 3 Overall Demo Program Structure

using System;
namespace LinearDiscriminate
  class LinearDiscriminateProgram
    static void Main(string[] args)
      Console.WriteLine("Begin LDA demo");
      // All program control statements here
      Console.WriteLine("End LDA demo");
    static int Prediction(double[][] data,
      double[] x, double[][] w) { . . }
    static int Prediction(double[][] data,
      double[][] x, double[][] w) { . . }
    static double[][] Mean(double[][] data,
      int c) { . . }
    static double[][] Discriminate(double[][] data,
      bool unitize) { . . }
    static double[][] ScatterWithin(double[][] data) { . . }
    static double[][] Scatter(double[][] data, int c) { . . }
    static double[][] Unitize(double[][] vector) { . . }
    // Matrix helper methods here
} // ns

The demo program is too long to present in its entirety here, but you can find the complete source code in the download that accompanies this article. I used a static method approach rather than an object-oriented programming approach.

The Main method begins by setting up the hardcoded eight-item training data set into an array-of-arrays-style matrix:

double[][] data = new double[8][];
data[0] = new double[] { 1.0, 4.0, 0 };
data[1] = new double[] { 2.0, 2.0, 0 };
// Etc. for [2] through [6]
data[7] = new double[] { 9.0, 8.0, 1 };
ShowMatrix(data, 2, true);

In a non-demo scenario, you’d probably read data from a text file into memory using a method named something like LoadData. Calculating the discriminate is performed like so:

double[][] w = Discriminate(data, true);
Console.WriteLine("Discriminate is: ");
ShowMatrix(w, 6, true);

Notice that the return value of the Discriminate method is an array-of-arrays matrix rather than an array. Most of the operations used in LDA are matrix operations, such as matrix multiplication and matrix inversion. Here, instead of an array with two cells, w is a matrix with two rows and one column. Such one-column matrices are sometimes called column vectors. Unless you work with matrices often, it can take a while to get accustomed to working with them instead of arrays.

In Figure 1, you can see that several intermediate objects are calculated and displayed. The display statements are inside method Discriminate and its helper methods and are intended to help you understand LDA. You’d probably remove the display statements in production code.

The statements setting up the item to predict are:

double[] x = new double[] { 4.0, 5.0 };
Console.WriteLine("Predicting class for Age Income =");
ShowVector(x, 1);

Here, for calling convenience, the item to predict is housed in an ordinary numeric array, even though it will have to be converted to a one-column matrix later. The prediction statements are:

int c = Prediction(data, x, w);
Console.Write("Predicted class: " + c);
if (c == 0)
  Console.WriteLine(" (liberal)");
  Console.WriteLine(" (conservative)");

The prediction method accepts the data matrix in order to compute the mean-of-means as shown in Figure 2. The method also requires the item to predict, x, and the discriminate vector, w.

Calculating the Discriminate

Method Discriminate calculates the LDA discriminate vector. The code, with WriteLine and display statements removed, is surprisingly short because most of the work is done by helper methods:

static double[][] Discriminate(double[][] data, bool unitize)
  double[][] mean0 = Mean(data, 0);
  double[][] mean1 = Mean(data, 1);
  double[][] Sw = ScatterWithin(data);
  double[][] SwInv = MatrixInverse(Sw);
  double[][] diff = MatrixSubtract(mean0, mean1);
  double[][] w = MatrixProduct(SwInv, diff);
  if (unitize == true) return Unitize(w);
  else return w;

The math behind the LDA discriminate is fairly complex, but the results are rather simple. Helper method Mean computes the mean of a given class (0 or 1). For example, the three data items that represent liberal (class 0) are (1, 4), (2, 2), and (3, 3). Their mean is ( (1+2+3) / 3, (4+2+3)/3 ) = (2.0, 3.0).

The scatter-within matrix is a measure of how variable the data set is. With the two class means, mean0 and mean1, and the scatter-­within matrix Sw in hand, the discriminate is the matrix product of the inverse of Sw and the matrix difference of mean0 and mean1.

You can find several resources on the Internet that explain the rather complex math that derives the equation for the discriminate. Be aware that there are many different versions of the derivation for w. In particular, you may see references to covariance and to scatter-between (Sb). Covariance is a mathematical entity that’s equivalent to scatter-within. Scatter-between is a mathematical entity that’s used in the derivation of the equation for the LDA discriminate, but scatter-between is not explicitly used to calculate the discriminate or to make a prediction.

Method Discriminate has a Boolean parameter named unitize that indicates whether or not to scale the discriminate to unit length, that is, a length equal to 1.0. In most situations you want to pass true as the corresponding argument.

Making a Prediction

The demo program has two overloaded Prediction methods. The first accepts the item to predict, x, as a normal numeric array:

static int Prediction(double[][] data, double[] x, double[][] w)
  double[][] xm = MatrixFromVector(x);
  return Prediction(data, xm, w);

This version of Prediction is for calling convenience and is just a wrapper around the version of Prediction that does the work:

static int Prediction(double[][] data, double[][] x, double[][] w)
  int dim = data[0].Length - 1; // at least 2
  double[][] wT = MatrixTranspose(w); // 1 x d
  double[][] m0 = Mean(data, 0); // d x 1
  double[][] m1 = Mean(data, 1); // d x 1
  double[][] m = MatrixAdd(m0, m1); // d x 1
  m = MatrixProduct(m, 0.5); // d x 1 
  double[][] tc = MatrixProduct(wT, m); // ((1xd)(dx1) = 1 x 1
  double[][] wTx = MatrixProduct(wT, x); // (1xd)(dx1) = 1 x 1
  if (wTx[0][0] > tc[0][0]) return 0; else return 1;

Method Prediction computes the transpose of w so it can be used in matrix multiplication. The means of the two classes are calculated, then the average of those two means is calculated by multiplying each component value by 0.5 and then stored in matrix m.

Matrix tc is the threshold constant and is the product of the transpose of the discriminate vector, wT, and the average of the class means, m. Matrix tc will always be a 1x1 matrix, holding a single value. The value in matrix tc represents the projection of the mean onto the discriminate vector.

The projection of the item to predict, x, onto the discriminate vector is computed similarly, as the matrix product of the transpose of the discriminate vector and x. Unfortunately, because the projection of the discriminate can be positive or negative, the Boolean comparison operator to use, less-than or greater-than, will vary from problem to problem. The simplest approach is to try greater-than and see if it gives results that make sense. You can programmatically determine which operator to use, but that approach adds a lot more code than you might expect.

An option when making a prediction is to adjust the projection of the average of the class means by taking into account the probabilities of each class. This adjustment factor is log(p0 / p1), where p0 is the probability of class 0 and p1 is the probability of class 1. For the demo data, p0 = 3/8 = 0.375 and p1 = 5/8 = 0.625, so the adjustment factor is log(0.375 / 0.625) = log(0.6) = -0.22. Notice that if the two probabilities are assumed to be equal, then p0 = p1 and the adjustment factor would be log(1) = 0.

Comments and Limitations

There are actually several different variations of LDA. The one presented in this article is usually called Fisher’s LDA. LDA can also be used for feature selection (identifying which predictor variables are most useful so that less-useful predictors can be ignored), as well as classification. And note there’s an entirely different statistical LDA (latent Dirichlet allocation) used in natural language processing.

Although LDA binary classification is mathematically elegant, it has several critical limitations. Behind the scenes, different versions of LDA make various assumptions, for example that the predictor variables are normally distributed and have similar covariances. In many situations, these assumptions are not true. Even so, in practice, LDA often works quite well even when the math assumptions aren’t met. Another math limitation is that LDA must compute the inverse of the scatter-within matrix. Matrix inversion is a very tricky process and can easily fail.

Perhaps the biggest limitation of LDA binary classification is that it assumes the two classes are linearly separable. Loosely speaking, this means that when graphed as in Figure 2, it’s possible to find a straight line that separates the two classes. The data for many problems just isn’t linearly separable.

In my opinion, LDA binary classification is beautiful, but there are alternative classification algorithms, notably logistic regression classification and neural network classification, which are more practical. But LDA is sure interesting.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following technical experts at Microsoft Research for reviewing this article: Madison Minsk and Amy Shan