2015-07-01

July 2014

Volume 29 Number 7

Test Run : Distorting the MNIST Image Data Set

James McCaffrey The mixed National Institute of Standards and Technology (MNIST) data set is a collection of 70,000 small images of handwritten digits. The data was created to act as a benchmark for image recognition algorithms. Even though MNIST images are small (28 x 28 pixels), and there are only 10 possible digits (zero through nine) to recognize, and there are 60,000 training images for creating an image recognition model (with 10,000 images held out to test the accuracy of a model), experience has shown that recognizing the MNIST images is a difficult problem.

One way to deal with difficult pattern classification problems, including image recognition, is to use more training data. And a clever way to programmatically generate more training data is to distort each original image. Take a look at the demo program in Figure 1. The figure shows the digit 4 in original MNIST form on the left, and the digit after distortion using elastic deformation on the right. The parameters in the upper-right corner of the demo application indicate the distortion depends on values for a displacement randomization seed, a kernel size and standard deviation, and an intensity.

Figure 1 A Distorted MNIST Image

It’s unlikely you’ll need to distort images in most work environments, but you might find the information in this article useful for three reasons. First, understanding exactly how image distortion works by seeing actual code will help you understand many image-recognition articles. Second, several of the programming techniques used in image distortion can be useful in other, more common programming scenarios. And third, you might just find image distortion an interesting topic for its own sake.

This article assumes you have advanced-level programming skills but doesn’t assume you know anything about image distortion. The demo program is coded in C# and makes extensive use of the Microsoft .NET Framework so refactoring to a non-.NET language would be challenging. The demo has most normal error checking removed to keep the size of the code small and the main ideas clear. Because the demo is a Windows Form application created by Visual Studio, much of the code is related to the UI and is spread over several files. However, I’ve refactored the demo code to a single C# source code file, which is available at msdn.microsoft.com/magazine/msdnmag0714. You can find the MNIST data in several places on the Internet. The main repository is at yann.lecun.com/exdb/mnist.

Overall Program Structure

To create the demo program I launched Visual Studio and created a Windows Form application named MnistDistort. The UI has eight TextBox controls for the paths to the unzipped MNIST data files (one pixel file, one label file); the indices of the currently displayed and next image; and a seed value, kernel size, kernel standard deviation and intensity value for the distortion process. A DropDown control holds values to magnify the image view. There are three Button controls to load the 60,000 MNIST images and labels into memory, display an image, and distort the displayed image. There are two PictureBox controls to display regular and distorted images. Finally, a ListBox control is used to display progress and logging messages.

At the top of the source code I removed references to unneeded namespaces and added a reference to the System.IO namespace to be able to read the MNIST data files.

I added a class-scope array named trainImages, which holds references to program-defined DigitImage objects, and variables to hold the locations of the two MNIST data files:

DigitImage[] trainImages = null;
string pixelFile = @"C:\MnistDistort\train-images.idx3-ubyte"; // Edit
string labelFile = @"C:\MnistDistort\train-labels.idx1-ubyte"; // Edit

In the Form constructor I added these six lines of code:

textBox1.Text = pixelFile;
textBox2.Text = labelFile;
comboBox1.SelectedItem = "6"; // Magnification
textBox3.Text = "NA"; // Curr index
textBox4.Text = "0"; // Next index
this.ActiveControl = button1;

The Button1 click event handler loads the 60,000 images into memory:

string pixelFile = textBox1.Text;
string labelFile = textBox2.Text;
trainImages = LoadData(pixelFile, labelFile);
listBox1.Items.Add("MNIST training images loaded into memory");

The Button2 click event handler displays the next image and updates the UI controls:

int nextIndex = int.Parse(textBox4.Text);
DigitImage currImage = trainImages[nextIndex];
int mag = int.Parse(comboBox1.SelectedItem.ToString());
Bitmap bitMap = MakeBitmap(currImage, mag);
pictureBox1.Image = bitMap;
textBox3.Text = textBox4.Text; // Update curr index
textBox4.Text = (nextIndex + 1).ToString(); // next index
textBox8.Text = textBox3.Text;  // Random seed
listBox1.Items.Add("Curr image index = " + textBox3.Text +
  " label = " + currImage.label);

You’ll find the methods LoadData and MakeBitmap in the accompanying code download. Most of the distortion work is performed by the methods called by the Button3 click event handler, which is presented in Figure 2.

Figure 2 Image Distortion Calling Code

private void button3_Click(object sender, EventArgs e)
{
  int currIndex = int.Parse(textBox3.Text);
  int mag = int.Parse(comboBox1.SelectedItem.ToString());
  int kDim = int.Parse(textBox5.Text); // Kernel dimension
  double kStdDev = double.Parse(textBox6.Text); // Kernel std dev
  double intensity = double.Parse(textBox7.Text);
  int rndSeed = int.Parse(textBox8.Text);  // Randomization
  DigitImage currImage = trainImages[currIndex];
  DigitImage distorted = Distort(currImage, kDim, kStdDev,
    intensity, rndSeed);
  Bitmap bitMapDist = MakeBitmap(distorted, mag);
  pictureBox2.Image = bitMapDist;
}

Method Distort calls methods MakeKernel (to create a smoothing matrix), MakeDisplace (directions and distances to deform each pixel in an image) and Displace (to actually deform the source image). Helper method MakeDisplace calls sub-helper ApplyKernel to smooth displacement values. Sub-helper method ApplyKernel calls sub-sub helper method Pad.

Elastic Deformation

The basic idea of distorting an image using elastic deformation is fairly simple. In the case of an MNIST image, you want to move each existing pixel slightly. But the details aren’t so simple. A naive approach that moves each pixel independently leads to new images that appear broken up rather than stretched. For example, consider the conceptual images in Figure 3. The two images represent the distortion of a 5 x 5 section of an image. Each arrow indicates the direction and distance of a move by a corresponding pixel. The left image shows more or less random vectors, which would break up rather than distort an image. The right image shows vectors that are related to each other, leading to a stretched image.

Figure 3 Random vs. Smoothed Displacement Fields

So the trick is to displace each pixel in such a way that pixels close to each other move in a relatively similar, but not exactly the same, direction and distance. This can be accomplished using a matrix called a continuous Gaussian kernel. The overall idea is probably best explained using code. Consider this method in the demo:

private DigitImage Distort(DigitImage dImage, int kDim,
  double kStdDev, double intensity, int seed)
{
  double[][] kernel = MakeKernel(kDim, kStdDev);
  double[][] xField = MakeDisplace(dImage.width, dImage.height,
   seed, kernel, intensity);
  double[][] yField = MakeDisplace(dImage.width, dImage.height,
    seed + 1, kernel, intensity);
  byte[][] newPixels = Displace(dImage.pixels, xField, yField);
  return new DigitImage(dImage.width, dImage.height,
    newPixels, dImage.label);
}

Method Distort accepts a DigitImage object and four numeric parameters related to a kernel. Type DigitImage is a program-defined class that represents the 28 x 28 bytes that make up the pixels of an MNIST image. The method first creates a kernel where kDim is the size of the kernel and kStdDev is a value that influences how similar the displacement of pixels will be.

To displace a pixel, it’s necessary to know how far to move in the left-right direction, and in the up-down direction. This information is stored in arrays xField and yField, respectively, and is computed using helper method MakeDisplace. Helper method Displace accepts the pixel values of a DigitImage image and uses the displacement fields to generate new pixel values. The new pixel values are then fed to a DigitImage constructor, yielding a new, distorted image. To summarize, to distort an image, you create a kernel. The kernel is used to generate x and y direction fields that are related rather than independent. The direction fields are applied to a source image to produce a distorted version of that image.

Gaussian Kernels

A continuous Gaussian kernel is a matrix of values that sum to 1.0; has the largest value in the center; and is radially symmetric. Here is a 5 x 5 Gaussian kernel with a standard deviation of 1.0:

0.0030   0.0133   0.0219   0.0133   0.0030
0.0133   0.0596   0.0983   0.0596   0.0133
0.0219   0.0983   0.1621   0.0983   0.0219
0.0133   0.0596   0.0983   0.0596   0.0133
0.0030   0.0133   0.0219   0.0133   0.0030

Notice that values near each other are different but similar. The standard deviation value determines how close the kernel values are. A larger standard deviation gives values that are closer together. For example, using a standard deviation of 1.5 gives a 5 x 5 kernel with first row values of:

0.0232   0.0338   0.0383   0.0338   0.0232

This may seem odd at first because standard deviation is a measure of data spread and larger standard deviation values for a data set indicate a larger spread. But in the context of a Gaussian kernel, the standard deviation is used to generate the values and is not a measure of spread in the resulting kernel. The method used by the demo program to generate a Gaussian kernel is presented in Figure 4.

Figure 4 Method MakeKernel

private static double[][] MakeKernel(int dim, double sd)
{
  if (dim % 2 == 0)
    throw new Exception("kernel dim must be odd");
  double[][] result = new double[dim][];
  for (int i = 0; i < dim; ++i)
    result[i] = new double[dim];
  int center = dim / 2; // Note truncation
  double coef = 1.0 / (2.0 * Math.PI * (sd * sd));
  double denom = 2.0 * (sd * sd);
  double sum = 0.0; // For more accurate normalization
  for (int i = 0; i < dim; ++i) {
    for (int j = 0; j < dim; ++j) {
      int x = Math.Abs(center - j);
      int y = Math.Abs(center - i);
      double num = -1.0 * ((x * x) + (y * y));
      double z = coef * Math.Exp(num / denom);
      result[i][j] = z;
      sum += z;
    }
  }
  for (int i = 0; i < dim; ++i)
    for (int j = 0; j < dim; ++j)
      result[i][j] = result[i][j] / sum;
  return result;
}

Generating Gaussian kernels can be a somewhat confusing task because there are many algorithm variations depending on how the kernel is intended to be used, and several variations of approximation techniques. The basic math definition for the values in a two-dimensional continuous Gaussian kernel is:

z = (1.0 / (2.0 * pi^2)) * exp((-(x^2 + y^2)) / (2 * sd^2))

Here x and y are the x and y coordinates of a cell in the kernel relative to the center cell; pi is the math constant; exp is the exponential function; and sd is the specified standard deviation. The leading coefficient term of 1.0 / (2.0 * Pi^2) is actually a normalizing term for the one-dimensional version of a Gaussian function. But for 2D kernels, you want to sum all kernel preliminary values and then divide each preliminary value by the sum so that all final values will add to 1.0 (subject to rounding error). In Figure 4, this final normalization is accomplished using the variable named sum. Therefore, the variable named coef is redundant and can be omitted from the code; variable coef was included here because most research papers describe kernels using the coefficient term.

Displacement Fields

To distort an image, each pixel must be moved (virtually, not literally) a certain distance to the left or right, and up or down. Method MakeDisplace is defined in Figure 5. Method MakeDisplace returns an array-of-arrays-style matrix that corresponds to one-half of the conceptual matrix in Figure 3. That is, the values in the cells of the return matrix correspond to a direction and magnitude of a pixel move in either the x-direction or the y-direction. Because the size of an MNIST image is 28 x 28 pixels, the return matrix of MakeDisplace will also be 28 x 28.

Figure 5 Making a Displacement Field

private static double[][] MakeDisplace(int width, int height, int seed,
  double[][] kernel, double intensity)
{
  double[][] dField = new double[height][];
  for (int i = 0; i < dField.Length; ++i)
    dField[i] = new double[width];
  Random rnd = new Random(seed);
  for (int i = 0; i < dField.Length; ++i)
    for (int j = 0; j < dField[i].Length; ++j)
      dField[i][j] = 2.0 * rnd.NextDouble() - 1.0;
  dField = ApplyKernel(dField, kernel); // Smooth
  for (int i = 0; i < dField.Length; ++i)
    for (int j = 0; j < dField[i].Length; ++j)
      dField[i][j] *= intensity;
  return dField;
}

Method MakeDisplace generates a matrix with initial random values between -1 and +1. Helper ApplyKernel smooths the random values as suggested by Figure 3. The smoothed values are essentially direction components with distance between 0 and 1. Then all the values are multiplied by an intensity parameter to increase the stretch distance.

Applying a Kernel and a Displacement

Applying a kernel to a matrix of displacements and then using the resulting smoothed displacements to generate new pixel values is rather tricky. The first part of the process is illustrated in Figure 6. The left part of the figure represents the preliminary random displacement values between -1 and +1 in the x direction for an 8 x 8 image. The value at row [3], column [6] (0.40) is being smoothed using a 3 x 3 kernel. The new displacement is a weighted average of the current value and the values of the eight closest neighbors. Because each new displacement value is in essence an average of its neighbors, the net effect is to generate values that are related to each other.

After smoothing, the displacement values are multiplied by an intensity factor (sometimes called alpha in research literature). For example, if the intensity factor is 20, then the final x-displacement in Figure 6 for the image pixel at (3, 6) would be 0.16 * 20 = +3.20. There would be a similar y-displacements matrix. Suppose the final value at (3, 6) in the y-displacements matrix was -1.50. The values +3.20 and -1.50 that correspond to the pixel at (3, 6) are now applied to the source image to produce a distorted image, but not in an entirely obvious way.

Figure 6 Applying a Kernel to a Displacement Matrix

First, lower and upper boundaries are determined. For the +3.20 x-displacement, these are 3 and 4. For the -1.50 y-displacement, they’re -2 and -1. The four boundaries generate four (x, y) displacement pairs: (3, -2), (3, -1), (4, -2), (4, -1). Recall these values correspond to the original image pixel value at indices (3, 6). Combining the pixel indices with the four displacement pairs generates four indices pairs: (6, 4), (6, 5), (7, 4), (7, 5). Finally, the pixel value for the distorted image at (3, 6) is the average of the original pixel values at indices (6, 4), (5, 3), (7, 4) and (7, 5).

Because of the geometry used, it’s most common to restrict kernels to an odd dimension such as 3, 5 and so on. Notice that there’d be a problem trying to smooth any preliminary displacement values near the edge of the displacements matrix because the kernel would extend, so to speak, beyond the edge of the matrix. There are several ways to deal with the edge issue. One way is to pad the preliminary displacement matrix with dummy rows and columns. The number of rows or columns to pad will be equal to one-half (using integer truncation) of the dimension of the kernel.

Wrapping Up

The image elastic deformation process described in this article is just one of many possible approaches. Most, but not all, of the distortion algorithm presented here was adapted from the research article, ”Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis,” which is available online at bit.ly/REzsnM.

The demo program generates deformed images in order to create additional training data for an image-recognition system. If you’re actually training an image-recognition system you can refactor the demo code to generate new training data on the fly, or you can refactor the code to generate and then save the distorted images as a text or binary file.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Reach him at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Wolf Kienzle (Microsoft Research)

Share via