January 2019

Volume 34 Number 1

[Machine Learning]

Introduction to PyTorch on Windows

By James McCaffrey

It’s possible, though quite difficult, to create neural networks from raw code. Luckily, there are many open source code libraries you can use to speed up the process. These libraries include CNTK (Microsoft), TensorFlow (Google) and scikit-learn. Most neural network libraries are written in C++ for performance but have a Python API for convenience.

In this article I’ll show you how to get started with the popular PyTorch library. Compared to many other neural network libraries, PyTorch operates at a lower level of abstraction. This gives you more control over your code and allows you to customize more easily, at the expense of having to write additional code.

The best way to see where this article is headed is to take a look at the demo program in Figure 1. The demo program reads the well-known Iris dataset into memory. The goal is to predict the species of an Iris flower (setosa, versicolor or virginica) from four predictor values: sepal length, sepal width, petal length and petal width. A sepal is a leaf-like structure.

The Iris Dataset Example Using PyTorch
Figure 1 The Iris Dataset Example Using PyTorch

The complete Iris dataset has 150 items. The demo program uses 120 items for training and 30 items for testing. The demo first creates a neural network using PyTorch, then trains the network using 600 iterations. After training, the model is evaluated using the test data. The trained model has an accuracy of 90.00 percent, which means the model correctly predicts the species of 27 of the 30 test items.

The demo concludes by predicting the species for a new, previously unseen Iris flower that has sepal and petal values (6.1, 3.1, 5.1, 1.1). The prediction probabilities are (0.0454, 0.6798, 0.2748), which maps to a prediction of versicolor.

This article assumes you have intermediate or better programming skills with a C-family language, but doesn’t assume you know anything about PyTorch. The complete demo code is presented in this article. The source code and the two data files used by the demo are also available in the download that accompanies this article. All normal error checking has been removed to keep the main ideas as clear as possible.

Installing PyTorch

Installing PyTorch involves two steps. First you install Python and several required auxiliary packages such as NumPy and SciPy, then you install PyTorch as an add-on package. Although it’s possible to install Python and the packages required to run PyTorch separately, it’s much better to install a Python distribution. I strongly recommend using the Anaconda distribution of Python, which has all the packages you need to run PyTorch, plus many other useful packages. In this article, I address installation on a Windows 10 machine. Installation on macOS and Linux systems is similar.

Coordinating compatible versions of Python, required auxiliary packages and PyTorch is a non-trivial challenge. Almost all the installation failures I’ve seen have been due to version incompatibilities. At the time I’m writing this article, I’m using Ananconda3 5.2.0 (which contains Python 3.6.5 and NumPy 1.14.3 and SciPy 1.1.0) and PyTorch 0.4.1. These are all quite stable, but because PyTorch is relatively new and under continuous development, by the time you read this article there could be a newer version available.

Before starting, I recommend you uninstall any existing Python systems you have on your machine, using the Windows Control Panel | Programs and Features. I also suggest creating a C:\PyTorch directory to hold installation files and project (code and data) files.

To install the Anaconda distribution, go to repo.continuum.io/archive and look for file Anaconda3-5.2.0-Windows-x86_64.exe, which is a self-extracting executable. If you click on the link, you’ll get a dialog with buttons to Run or Save. You can click on the Run button.

The Anaconda installer is very user-friendly. You’ll be presented with a set of eight installation wizard screens. You can accept all defaults and just click the Next button on each screen, with one exception. When you reach the screen that asks you if you want to add Python to your system PATH environment variable, the default is unchecked (no). I recommend checking that option so you don’t have to manually edit your system PATH. The default settings will place the Python interpreter and 500+ compatible packages in the C:\Users\<user>\AppData\Local\ Continuum\Anaconda3 directory.

To install the PyTorch library, go to pytorch.org and find the “Previous versions of PyTorch” link and click on it. Look for a file named torch-0.4.1-cp36-cp36m-win_amd64.whl. This is a Python “wheel” file. You can think of a .whl file as somewhat similar to a Windows .msi file. If you click on the link, you’ll get an option to Open or Save. Do a Save As and place the .whl file in your C:\PyTorch directory. If you can’t locate the PyTorch .whl file, try bit.ly/2SUiAuj, which is where the file was when I wrote this article.

You can install PyTorch using the Python pip utility, which you get with the Anaconda distribution. Open a Windows command shell and navigate to the directory where you saved the PyTorch .whl file. Then enter the following command:

C:\PyTorch> pip install torch-0.4.1-cp36-cp36m-win_amd64.whl

Installation is quick, but there’s a lot that can go wrong. If installation fails, read the error messages in the shell carefully. The problem will almost certainly be a version compatibility issue.

To verify that Python and PyTorch have been successfully installed, open a command shell and enter “python” to launch the Python interpreter. You’ll see the “>>>” Python prompt. Then enter the following commands (note there are two consecutive underscore characters in the version command):

C:\>python
>>> import torch as T
>>> T.__version__
'0.4.1'
>>> exit()
C:\>

If you see the responses shown here, congratulations, you’re ready to start writing neural network machine learning code using PyTorch.

Preparing the Iris Dataset

The raw Iris dataset can be found at bit.ly/1N5br3h. The data looks like this:

5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
...
7.0, 3.2, 4.7, 1.4, Iris-versicolor
6.4, 3.2, 4.5, 1.5, Iris-versicolor
...
6.2, 3.4, 5.4, 2.3, Iris-virginica
5.9, 3.0, 5.1, 1.8, Iris-virginica

The first four values on each line are a flower’s sepal length, sepal width, petal length and petal width. The fifth item is the species to predict. The raw data has 50 setosa, followed by 50 versicolor, followed by 50 virginica. The training file is the first 40 of each species (120 items), and the test file is the last 10 of each species (30 items). Because there are four predictor variables, it’s not feasible to graph the dataset. But you can get a rough idea of the structure of the data by examining the graph in Figure 2.

Partial Iris Data
Figure 2 Partial Iris Data

Neural networks only understand numbers so the species must be encoded. With most neural network libraries, you’d replace setosa with (1, 0, 0), versicolor with (0, 1, 0) and virginica with (0, 0, 1). This is called 1-of-N encoding or one-hot encoding. However, PyTorch performs one-hot encoding behind the scenes and expects 0, 1 or 2 for the three classes. Therefore, the encoded data for PyTorch looks like:

5.1, 3.5, 1.4, 0.2, 0
4.9, 3.0, 1.4, 0.2, 0
...
7.0, 3.2, 4.7, 1.4, 1
6.4, 3.2, 4.5, 1.5, 1
...
6.2, 3.4, 5.4, 2.3, 2
5.9, 3.0, 5.1, 1.8, 2

In most situations you should normalize the predictor variables, typically by scaling so that all values are between 0.0 and 1.0, using what’s called min-max normalization. I didn’t normalize the Iris data in order to keep the demo a bit simpler. When working with neural networks, I usually create a root folder for the problem, such as C:\PyTorch\Iris, and then a subdirectory named Data to hold the data files.

The Demo Program

The complete demo program, with a few minor edits to save space, is presented in Figure 3. I indent two spaces rather than the usual four spaces to save space. I used Notepad to edit the demo program, but there are dozens of Python editors that have advanced features. Note that Python uses the ‘\’ character for line continuation. 

Figure 3 The Iris Dataset Demo Program

# iris_nn.py
# PyTorch 0.4.1 Anaconda3 5.2.0 (Python 3.6.5)
import numpy as np
import torch as T
# -----------------------------------------------------------
class Batch:
  def __init__(self, num_items, bat_size, seed=0):
    self.num_items = num_items; self.bat_size = bat_size
    self.rnd = np.random.RandomState(seed)
  def next_batch(self):
    return self.rnd.choice(self.num_items, self.bat_size,
      replace=False)
# -----------------------------------------------------------
class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.fc1 = T.nn.Linear(4, 7)
    T.nn.init.xavier_uniform_(self.fc1.weight)  # glorot
    T.nn.init.zeros_(self.fc1.bias)
    self.fc2 = T.nn.Linear(7, 3)
    T.nn.init.xavier_uniform_(self.fc2.weight)
    T.nn.init.zeros_(self.fc2.bias)
  def forward(self, x):
    z = T.tanh(self.fc1(x))
    z = self.fc2(z)  # see CrossEntropyLoss() below
    return z
# -----------------------------------------------------------
def accuracy(model, data_x, data_y):
  X = T.Tensor(data_x)
  Y = T.LongTensor(data_y)
  oupt = model(X)
  (_, arg_maxs) = T.max(oupt.data, dim=1)
  num_correct = T.sum(Y==arg_maxs)
  acc = (num_correct * 100.0 / len(data_y))
  return acc.item()
# -----------------------------------------------------------
def main():
  # 0. get started
  print("\nBegin Iris Dataset with PyTorch demo \n")
  T.manual_seed(1);  np.random.seed(1)
  # 1. load data
  print("Loading Iris data into memory \n")
  train_file = ".\\Data\\iris_train.txt"
  test_file = ".\\Data\\iris_test.txt"
  train_x = np.loadtxt(train_file, usecols=range(0,4),
    delimiter=",",  skiprows=0, dtype=np.float32)
  train_y = np.loadtxt(train_file, usecols=[4],
    delimiter=",", skiprows=0, dtype=np.float32)
  test_x = np.loadtxt(test_file, usecols=range(0,4),
    delimiter=",",  skiprows=0, dtype=np.float32)
  test_y = np.loadtxt(test_file, usecols=[4],
    delimiter=",", skiprows=0, dtype=np.float32)
  # 2. define model
  net = Net()
# -----------------------------------------------------------
  # 3. train model
  net = net.train()  # set training mode
  lrn_rate = 0.01; b_size = 12
  max_i = 600; n_items = len(train_x)
  loss_func = T.nn.CrossEntropyLoss()  # applies softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
  batcher = Batch(num_items=n_items, bat_size=b_size)
  print("Starting training")
  for i in range(0, max_i):
    if i > 0 and i % (max_i/10) == 0:
      print("iteration = %4d" % i, end="")
      print("  loss = %7.4f" % loss_obj.item(), end="")
      acc = accuracy(net, train_x, train_y)
      print("  accuracy = %0.2f%%" % acc)
    curr_bat = batcher.next_batch()
    X = T.Tensor(train_x[curr_bat])
    Y = T.LongTensor(train_y[curr_bat])
    optimizer.zero_grad()
    oupt = net(X)
    loss_obj = loss_func(oupt, Y)
    loss_obj.backward()
    optimizer.step()
  print("Training complete \n")
  # 4. evaluate model
  net = net.eval()  # set eval mode
  acc = accuracy(net, test_x, test_y)
  print("Accuracy on test data = %0.2f%%" % acc) 
  # 5. save model
  # TODO
# -----------------------------------------------------------
  # 6. make a prediction
  unk = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  unk = T.tensor(unk)  # to Tensor
  logits = net(unk)  # values do not sum to 1.0
  probs_t = T.softmax(logits, dim=1)  # as Tensor
  probs = probs_t.detach().numpy()    # to numpy array
  print("\nSetting inputs to:")
  for x in unk[0]: print("%0.1f " % x, end="")
  print("\nPredicted: (setosa, versicolor, virginica)")
  for p in probs[0]: print("%0.4f " % p, end="")
  print("\n\nEnd Iris demo")
if __name__ == "__main__":
  main()

The structure of a PyTorch program differs somewhat from that of other libraries. In the demo, program-defined class Batch serves up a specified number of training items for training. Class Net defines a 4-7-3 neural network. Function accuracy computes the classification accuracy (percentage of correct predictions) of data using a specified model/network. All of the control logic is contained in a main function.

Because PyTorch and Python are being developed so quickly, you should include a comment that indicates what versions are being used. Many programmers who are new to Python are surprised to learn that base Python doesn’t support arrays. NumPy arrays are used by PyTorch so you’ll almost always import the NumPy package.

Defining the Neural Network

The definition of the neural network begins with:

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.fc1 = T.nn.Linear(4, 7)
    T.nn.init.xavier_uniform_(self.fc1.weight)
    T.nn.init.zeros_(self.fc1.bias)
...

The first line of code indicates that the class inherits from a T.nn.Module class, which contains functions for creating a neural network. You can think of the __init__ function as a class constructor. Object fc1 ("fully connected layer 1") is the network hidden layer, which expects four input values (the predictor values) and has seven processing nodes. The number of hidden nodes is a hyperparameter and must be determined by trial and error. The hidden layer weights are initialized using the Xavier uniform algorithm, which is called Glorot uniform in most other libraries. The hidden layer biases are all initialized to zero.

The network output layer is defined by:

self.fc2 = T.nn.Linear(7, 3)
T.nn.init.xavier_uniform_(self.fc2.weight)
T.nn.init.zeros_(self.fc2.bias)

The output layer expects seven inputs (from the hidden layer) and produces three output values, one for each possible species. Notice that the hidden and output layer aren’t logically connected at this point. The connection is established by the required forward function:

def forward(self, x):
  z = T.tanh(self.fc1(x))
  z = self.fc2(z)  # no softmax!
  return z

The function accepts x, which is the input predictor values. These values are passed to the hidden layer and the results are then passed to the tanh activation function. That result is passed to the output layer, and the final results are returned. Unlike many neural network libraries, with PyTorch you don’t apply softmax activation to the output layer because softmax will be automatically applied by the training loss function. If you did apply softmax to the output layer, your network would still work, but training would be slower because you’d be applying softmax twice.

Loading the Data into Memory

When using PyTorch, you load data into memory in NumPy arrays and then convert the arrays to PyTorch Tensor objects. You can loosely think of a Tensor as a sophisticated array that can be handled by a GPU processor.

There are several ways to load data into a NumPy array. Among my colleagues, the most common technique is to use the Python Pandas (originally “panel data,” now “Python data analysis”) package. However, Pandas has a bit of a learning curve, so for simplicity the demo program uses the NumPy loadtxt function. The training data is loaded like so:

train_file = ".\\Data\\iris_train.txt"
train_x = np.loadtxt(train_file, usecols=range(0,4),
  delimiter=",",  skiprows=0, dtype=np.float32)
train_y = np.loadtxt(train_file, usecols=[4],
  delimiter=",", skiprows=0, dtype=np.float32)

PyTorch expects the predictor values to be in an array-of-arrays-style matrix and the class values to predict to be in an array. After these statements are executed, matrix train_x will have 120 rows and four columns, and train_y will be an array with 120 values. Most neural network libraries, including PyTorch, use float32 data as the default data type because the precision gained by using 64-bit variables isn’t worth the performance penalty incurred.

Training the Neural Network

The demo creates the neural network and then prepares training with these statements:

net = Net()
net = net.train()  # set training mode
lrn_rate = 0.01; b_size = 12
max_i = 600; n_items = len(train_x)
loss_func = T.nn.CrossEntropyLoss()  # applies softmax()
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
batcher = Batch(num_items=n_items, bat_size=b_size)

Setting the network into training mode isn’t required for the demo because training doesn’t use dropout or batch normalization, which have different execution flows for training and evaluation. The learning rate (0.01), batch size (12) and maximum training iterations (600) are hyperparameters. The demo uses iterations rather than epochs because one epoch usually refers to processing all training items one time each. Here, one iteration means processing only 12 of the training items.

The CrossEntropyLoss function is used to measure error for multiclass classification problems where there are three or more classes to predict. A common error is to try and use it for binary classification. The demo uses stochastic gradient descent, which is the most rudimentary form of training optimization. For realistic problems, PyTorch supports sophisticated algorithms including adaptive moment estimation (Adam), adaptive gradient (Adagrad) and resilient mean squared propagation (RMSprop).

The program-defined Batch class implements the simplest possible batching mechanism. On each call to its next_batch function, 12 randomly selected indices from the 120 possible training data indices are returned. This approach doesn’t guarantee that all training items will be used the same number of times. In a non-demo scenario, you’d likely want to implement a more sophisticated batcher that randomly selects different indices until all have been selected once, and then resets itself.

Training is performed exactly 600 times. Every 600 / 10 = 60 iterations, the demo displays progress information:

for i in range(0, max_i):
  if i > 0 and i % (max_i/10) == 0:
    print("iteration = %4d" % i, end="")
    print("  loss = %7.4f" % loss_obj.item(), end="")
    acc = accuracy(net, train_x, train_y)
    print("  accuracy = %0.2f%%" % acc)

The average cross entropy loss/error value for the current batch of 12 training items can be accessed through the object’s item function. In general, cross entropy loss is difficult to interpret during training, but you should monitor it to make sure that it’s gradually decreasing, which indicates training is working.

Somewhat unusually, at the time I’m writing this article, PyTorch doesn’t have a built-in function to give you classification accuracy. The program-defined accuracy function computes the classification accuracy of the model using the current weights and biases values. Accuracy is much easier to interpret than loss or error, but it’s a cruder metric.

Inside the training loop, a batch of items is selected from the 120-item dataset and converted into Tensor objects:

curr_bat = batcher.next_batch()
X = T.Tensor(train_x[curr_bat])
Y = T.LongTensor(train_y[curr_bat])

Recall that curr_bat is an array of 12 indices into the training data so train_x[curr_bat] has 12 rows and four columns. This matrix is converted into PyTorch Tensor objects by passing the matrix to the Tensor function. For a classification problem, you must convert the encoded class label values into LongTensor objects rather than Tensor objects.

The actual training is performed by these five statements:

optimizer.zero_grad()
oupt = net(X)
loss_obj = loss_func(oupt, Y)
loss_obj.backward()
optimizer.step()

You can essentially consider these statements as magic PyTorch incantations that perform training using back-propagation. You must first zero-out the weight and bias gradient values from the previous iteration. The call to the net function passes the current batch of 12 Tensor objects to the network and computes the 12 output values using the forward function. The calls to backward and step compute the gradient values and use them to update weights and biases.

Evaluating and Using the Model

After training completes, the demo computes model accuracy on the test data:

net = net.eval()  # set eval mode
acc = accuracy(net, test_x, test_y)
print("Accuracy on test data = %0.2f%%" % acc)

As before, setting the model to evaluation mode isn’t necessary in this example, but it doesn’t hurt to be explicit. The demo program doesn’t save the trained model, but in a non-demo scenario you might want to do so. PyTorch, along with most other neural network libraries (with the notable exception of TensorFlow) supports the Open Neural Network Exchange (ONNX) format.

The demo uses the trained model to predict the species of a new, previously unseen Iris flower:

unk = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
unk = T.tensor(unk)  # to Tensor
logits = net(unk)  # values do not sum to 1.0
probs_t = T.softmax(logits, dim=1)  # as Tensor
probs = probs_t.detach().numpy()    # to numpy array

The call to the net function returns three values that don’t necessarily sum to 1.0, for example (3,2, 4.5, 0.3), so the demo applies softmax to coerce the output values so that they do sum to 1.0 and can be loosely interpreted as probabilities. The values are Tensor objects so they’re converted into a NumPy array so they can be displayed more easily.

Wrapping Up

This article has just barely scratched the surface of the PyTorch library, but should give you all the information you need to start experimenting. As this article demonstrates, PyTorch is quite different from and operates at a lower level than CNTK, TensorFlow and scikit-learn. A common question is, “Which neural network library is best?” In a perfect world you could dedicate time and learn all the major libraries. But because these libraries are quite complicated, realistically most of my colleagues have one primary library. In my opinion, from a technical point of view, the three best libraries are CNTK, Keras/TensorFlow and PyTorch. But they’re all excellent and picking one library over another really depends mostly on your programming style and which one is most used by your colleagues or company.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Brian Broll, Yihe Dong, Chris Lee


Discuss this article in the MSDN Magazine forum