Article
01/04/2019

March 2018

Volume 33 Number 3

[Test Run]

Neural Binary Classification Using CNTK

James McCaffrey The goal of a binary classification problem is to make a prediction where the value to predict can take one of just two possible values. For example, you might want to predict if a hospital patient has heart disease or not, based on predictor variables such as age, blood pressure, sex and so on. There are many techniques that can be used to tackle a binary classification problem. In this article I’ll explain how to use the Microsoft Cognitive Toolkit (CNTK) library to create a neural network binary classification model.

Take a look at Figure 1 to see where this article is headed. The demo program creates a prediction model for the Cleveland Heart Disease dataset. The dataset has 297 items. Each item has 13 predictor variables: age, sex, pain type, blood pressure, cholesterol, blood sugar, ECG, heart rate, angina, ST depression, ST slope, number of vessels and thallium. The value to predict is the presence or absence of heart disease.

Figure 1 Binary Classification Using a CNTK Neural Network

Behind the scenes, the raw data was normalized and encoded, resulting in 18 predictor variables. The demo creates a neural network with 18 input nodes, 20 hidden processing nodes and two output nodes. The neural network model is trained using stochastic gradient descent with a learning rate set to 0.005 and a mini-batch size of 10.

During training, the average loss/error and the average classification accuracy on the current 10 items is displayed every 500 iterations. You can see that, in general, loss/error gradually decreased and accuracy increased over the 5,000 iterations. After training, the classification accuracy of the model on all 297 data items was computed to be 84.18% (250 correct, 47 incorrect).

This article assumes you have intermediate or better programming skill, but doesn’t assume you know much about CNTK or neural networks. The demo is coded using Python, but even if you don’t know Python, you should be able to follow along without too much difficulty. The code for the demo program is presented in its entirety in this article. The data file used is available in the accompanying download.

Understanding the Data

There are several versions of the Cleveland Heart Disease dataset at bit.ly/2EL9Leo. The demo uses the processed version, which has 13 of the original 76 predictor variables. The raw data has 303 items and looks like:

[001] 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
[002] 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
[003] 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
...
[302] 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
[303] 38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0

The first 13 values in each line are predictors. The last item in each line is a value between 0 and 4 where 0 means absence of heart disease and 1, 2, 3, or 4 means presence of heart disease. In general, the most time-consuming part of most machine learning scenarios is preparing your data. Because there are more than two predictor variables, it’s not possible to graph the raw data. But you can get a rough idea of the problem by looking at just age and blood pressure, as shown in Figure 2.

Figure 2 Cleveland Heart Disease Partial Raw Data

The first step is to deal with missing data—notice the “?” in item [303]. Because there are only six items with missing values, those six items were just tossed out, leaving 297 items.

The next step is to normalize the numeric predictor values, such as age in the first column. The demo used min-max normalization where the value in a column is replaced by (value - min) / (max - min). For example, the minimum age value is 29 and the maximum is 77, so the first age value, 63, is normalized to (63 - 29) / (77 - 29) = 34 / 48 = 0.70833.

The next step is to encode the categorical predictor values, such as sex (0 = female, 1 = male) in the second column and pain type (1, 2, 3, 4) in the third column. The demo used 1-of-(N-1) encoding so sex is encoded as female = -1, male = +1. Pain type is encoded as 1 = (1, 0, 0), 2 = (0, 1, 0), 3 = (0, 0, 1), 4 = (-1, -1, -1).

The last step is to encode the value to predict. When using a neural network for binary classification, you can encode the value to predict using just one node with a value of 0 or 1, or you can use two nodes with values of (0, 1) or (1, 0). For a reason I’ll explain shortly, when using CNTK, it’s much better to use the two-node technique. So, 0 (no heart disease) was encoded as (0, 1) and values 1 through 4 (heart disease) were encoded as (1, 0).

The final normalized and encoded data was tab-delimited and looks like:

|symptoms  0.70833  1  1  0  0  0.48113 ... |disease  0  1
|symptoms  0.79167  1 -1 -1 -1  0.62264 ... |disease  1  0
...

Tags “|symptoms” and “|disease” were inserted so the data could be easily read by a CNTK data reader object.

The Demo Program

The complete demo program, with a few minor edits to save space, is presented in Figure 3. All normal error checking has been removed. I indent with two space characters instead of the usual four as a matter of personal preference and to save space. The “\” character is used by Python for line continuation.

Figure 3 Demo Program Structure

# cleveland_bnn.py
# CNTK 2.3 with Anaconda 4.1.1 (Python 3.5, NumPy 1.11.1)
import numpy as np
import cntk as C
def create_reader(path, input_dim, output_dim, rnd_order, sweeps):
  x_strm = C.io.StreamDef(field='symptoms', shape=input_dim,
   is_sparse=False)
  y_strm = C.io.StreamDef(field='disease', shape=output_dim,
    is_sparse=False)
  streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm)
  deserial = C.io.CTFDeserializer(path, streams)
  mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, \
    max_sweeps=sweeps)
  return mb_src
# ===================================================================
def main():
  print("\nBegin binary classification (two-node technique) \n")
  print("Using CNTK version = " + str(C.__version__) + "\n")
  input_dim = 18
  hidden_dim = 20
  output_dim = 2
  train_file = ".\\Data\\cleveland_cntk_twonode.txt"
  # 1. create network
  X = C.ops.input_variable(input_dim, np.float32)
  Y = C.ops.input_variable(output_dim, np.float32)
  print("Creating a 18-20-2 tanh-softmax NN ")
  with C.layers.default_options(init=C.initializer.uniform(scale=0.01,\
    seed=1)):
    hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh,
      name='hidLayer')(X) 
    oLayer = C.layers.Dense(output_dim, activation=None,
     name='outLayer')(hLayer)
  nnet = oLayer
  model = C.ops.softmax(nnet)
  # 2. create learner and trainer
  print("Creating a cross entropy batch=10 SGD LR=0.005 Trainer ")
  tr_loss = C.cross_entropy_with_softmax(nnet, Y)
  tr_clas = C.classification_error(nnet, Y)
  max_iter = 5000
  batch_size = 10
  learn_rate = 0.005
  learner = C.sgd(nnet.parameters, learn_rate)
  trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])
  # 3. create reader for train data
  rdr = create_reader(train_file, input_dim, output_dim,
    rnd_order=True, sweeps=C.io.INFINITELY_REPEAT)
  heart_input_map = {
    X : rdr.streams.x_src,
    Y : rdr.streams.y_src
  }
  # 4. train
  print("\nStarting training \n")
  for i in range(0, max_iter):
    curr_batch = rdr.next_minibatch(batch_size, \
      input_map=heart_input_map)
    trainer.train_minibatch(curr_batch)
    if i % int(max_iter/10) == 0:
      mcee = trainer.previous_minibatch_loss_average
      macc = (1.0 - trainer.previous_minibatch_evaluation_average) \
        * 100
      print("batch %4d: mean loss = %0.4f, accuracy = %0.2f%% " \
        % (i, mcee, macc))
  print("\nTraining complete")
  # 5. evaluate model using all data
  print("\nEvaluating accuracy using built-in test_minibatch() \n")
  rdr = create_reader(train_file, input_dim, output_dim,
    rnd_order=False, sweeps=1)
  heart_input_map = {
    X : rdr.streams.x_src,
    Y : rdr.streams.y_src
  }
  num_test = 297
  all_test = rdr.next_minibatch(num_test, input_map=heart_input_map)
  acc = (1.0 - trainer.test_minibatch(all_test)) * 100
  print("Classification accuracy on the %d data items = %0.2f%%" \
    % (num_test,acc))
  # (could save model here)
  # (use trained model to make prediction)
  print("\nEnd Cleveland Heart Disease classification ")
# ===================================================================
if __name__ == "__main__":
  main()

The cleveland_bnn.py demo has one helper function, create_reader. All control logic is in a single main function. Because CNTK is young and under vigorous development, it’s a good idea to add a comment detailing which version is being used (2.3 in this case).

Installing CNTK can be a bit tricky. First, you install the Anaconda distribution of Python, which contains the required Python interpreter, required packages such as NumPy and SciPy, and useful utilities such as pip. I used Anaconda3 4.1.1 64-bit, which includes Python 3.5. After installing Anaconda, you install CNTK as a Python package, not as a standalone system, using the pip utility. From an ordinary shell, the command I used was:

>pip install https://cntk.ai/PythonWheel/CPU-Only/cntk-2.3-cp35-cp35m-win_amd64.whl

Almost all CNTK installation failures I’ve seen have been due to Anaconda-CNTK version incompatibilities.

The demo begins by preparing to create the neural network:

input_dim = 18
hidden_dim = 20
output_dim = 2
train_file = ".\\Data\\cleveland_cntk_twonode.txt"
X = C.ops.input_variable(input_dim, np.float32)
Y = C.ops.input_variable(output_dim, np.float32)

The number of input and output nodes is determined by your data, but the number of hidden processing nodes is a free parameter and must be determined by trial and error. Using 32-bit variables is typical for neural networks because the precision gained by using 64 bits isn’t worth the performance penalty incurred.

The network is created like so:

with C.layers.default_options(init=C.initializer.uniform(scale=0.01,\
  seed=1)):
  hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh,
    name='hidLayer')(X) 
  oLayer = C.layers.Dense(output_dim, activation=None,
   name='outLayer')(hLayer)
nnet = oLayer
model = C.ops.softmax(nnet)

The Python with statement is a syntactic shortcut to apply a set of common arguments to multiple functions. The demo uses tanh activation on the hidden layer nodes; a common alternative is the sigmoid function. Notice that there’s no activation applied to the output nodes. This is a quirk of CNTK because the CNTK training function expects raw, un-activated values. The nnet object is just a convenience alias. The model object has softmax activation so it can be used after training to make predictions. Because Python assigns by reference, training the nnet object also trains the mode object.

Training the Neural Network

The neural network is prepared for training with:

tr_loss = C.cross_entropy_with_softmax(nnet, Y)
tr_clas = C.classification_error(nnet, Y)
max_iter = 5000
batch_size = 10
learn_rate = 0.005
learner = C.sgd(nnet.parameters, learn_rate)
trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])

The tr_loss (“training loss”) object tells CNTK how to measure error when training. An alternative to cross entropy with softmax is squared error. The tr_clas (“training classification error”) object can be used to automatically compute the percentage of incorrect predictions during or after training.

The values for the maximum number of training iterations, the number of items in a batch to train at a time, and the learning rate, are all free parameters that must be determined by trial and error. You can think of the learner object as an algorithm, and the trainer object as the object that uses the learner to find good values for the neural network’s weights and biases.

A reader object is created with these statements:

rdr = create_reader(train_file, input_dim, output_dim,
  rnd_order=True, sweeps=C.io.INFINITELY_REPEAT)
heart_input_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}

If you examine the create_reader definition in Figure 3, you’ll see that it specifies the tag names (“symptoms” and “disease”) used in the data file. You can consider create_reader and the code to create a reader object as boilerplate code for neural binary classification problems. All you have to change are the tag names and the name of the mapping dictionary (heart_input_map).

After everything is prepared, training is performed like so:

for i in range(0, max_iter):
  curr_batch = rdr.next_minibatch(batch_size, \
    input_map=heart_input_map)
  trainer.train_minibatch(curr_batch)
  if i % int(max_iter/10) == 0:
    mcee = trainer.previous_minibatch_loss_average
    macc = (1.0 - trainer.previous_minibatch_evaluation_average) \
      * 100
    print("batch %4d: mean loss = %0.4f, accuracy = %0.2f%% " \
      % (i, mcee, macc))

An alternative to training with a fixed number of iterations is to stop training when loss/error drops below a threshold. It’s important to display loss/error during training because training failure is the rule rather than the exception. Cross-entropy error is a bit difficult to interpret directly, but you want to see values that tend to get smaller. Instead of displaying average classification loss/error, the demo computes and prints the average classification accuracy, which is a more natural metric in my opinion.

Evaluating and Using the Model

After a network has been trained, you’ll usually want to determine the loss/error and classification accuracy for the entire dataset that was used for training. The demo evaluates overall classification accuracy with:

rdr = create_reader(train_file, input_dim, output_dim,
  rnd_order=False, sweeps=1)
heart_input_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}
num_test = 297
all_test = rdr.next_minibatch(num_test, input_map=heart_input_map)
acc = (1.0 - trainer.test_minibatch(all_test)) * 100
print("Classification accuracy on the %d data items = %0.2f%%" \
  % (num_test,acc))

A new data reader is created. Notice that unlike the reader used for training, the new reader doesn’t traverse the data in random order, and that the number of sweeps is set to 1. The heart_input_map dictionary object is recreated. A common mistake is to try and use the original object—but the rdr object has changed, so you need to recreate the mapping. The test_minibatch function returns the average classification error for its mini-batch argument, which in this case is the entire dataset.

The demo program doesn’t compute the loss/error for the entire dataset. You can use the previous_minibatch_loss_average function, but you have to be careful not to perform an additional training iteration, which would change the network.

After training, or during training, you’ll usually want to save the model. In CNTK, saving looks like:

mdl_name = ".\\Models\\cleveland_bnn.model"
model.save(mdl_name)

This saves using the default CNTK v2 format. An alternative is to use the Open Neural Network Exchange (ONNX) format. Notice that you’ll generally want to save the model object (with softmax activation) rather than the nnet object.

From a different program, a saved model could be loaded into memory along the lines of:

mdl_name = ".\\Models\\cleveland_bnn.model"
model = C.ops.functions.Function.load(mdl_name)

After loading, the model can be used as if it had just been trained.

The demo program doesn’t use the trained model to make a prediction. You can write code like this:

unknown = np.array([0.5555, -1, ... ], dtype=np.float32)
predicted = model.eval(unknown)

The result returned into variable predicted would be a 1x2 matrix with values that sum to 1.0, for example [[0.2500, 0.7500]]. Because the second value is larger, the result would map to (0, 1), which in turn would map to “no disease.”

Wrapping Up

Most deep learning code libraries perform neural network binary classification using the one-node technique. Using this approach, the values of the variable to predict are encoded as either 0 or 1. The output dimension would be set to 1 instead of 2. You’d have to use binary cross-entropy error instead of ordinary cross-entropy error. CNTK doesn’t have a built-in classification error function that works with one node, so you’d have to implement your own function from scratch. When training, less information is typically gained on each iteration (although training is a bit faster), so you’ll usually have to train a one-node model for more iterations than when using the two-node technique. For these reasons, I prefer using the two-node technique for neural binary classification.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd, Ken Tran

Discuss this article in the MSDN Magazine forum