October 2017

Volume 32 Number 10

### [Test Run]

# Time-Series Regression Using a C# Neural Network

The goal of a time-series regression problem is to make predictions based on historical time data. For example, if you have monthly sales data (over the course of a year or two), you might want to predict sales for the upcoming month. Time-series regression is usually very difficult, and there are many different techniques you can use.

In this article, I’ll demonstrate how to perform a time-series regression analysis using rolling-window data combined with a neural network. The idea is best explained by an example. Take a look at the demo program in **Figure 1**. The demo program analyzes the number of airline passengers who traveled each month between January 1949 and December 1960.

**Figure 1 Rolling-Window Time-Series Regression Demo**

The demo data comes from a well-known benchmark dataset that you can find in many places on the Internet and is in-cluded with the download that accompanies this article. The raw data looks like:

```
"1949-01";112
"1949-02";118
"1949-03";132
"1949-04";129
"1949-05";121
"1949-06";135
"1949-07";148
"1949-08";148
...
"1960-11";390
"1960-12";432
```

There are 144 raw data items. The first field is the year and month. The second field is the total number of international airline passengers for the month, in thousands. The demo creates training data using a rolling window of size 4 to yield 140 training items. The training data is normalized by dividing each passenger count by 100:

```
[ 0] 1.12 1.18 1.32 1.29 1.21
[ 1] 1.18 1.32 1.29 1.21 1.35
[ 2] 1.32 1.29 1.21 1.35 1.48
[ 3] 1.29 1.21 1.35 1.48 1.48
...
[139] 6.06 5.08 4.61 3.90 4.32
```

Notice that the explicit time values in the data are removed. The first window consists of the first four passenger counts (1.12, 1.18, 1.32, 1.29), which are used as predictor values, followed by the fifth count (1.21), which is a value to predict. The next window consists of the second through fifth counts (1.18, 1.32, 1.29, 1.21), which are the next set of predictor values, followed by the sixth count (1.35), the value to predict. In short, each set of four consecutive passenger counts is used to predict the next count.

The demo creates a neural network with four input nodes, 12 hidden processing nodes and a single output node. The number of input nodes corresponds to the number of predictors in the rolling window. The window size must be deter-mined by trial and error, which is the biggest drawback to this technique. The number of neural network hidden nodes must also be determined by trial and error, which is always true for neural networks. There’s just one output node be-cause time-series regression predicts one time unit ahead.

The neural network has (4 * 12) + (12 * 1) = 60 node-to-node weights and (12 + 1) = 13 biases, which essentially define the neural network model. Using the rolling-window data, the demo program trains the network using the basic stochastic back-propagation algorithm with a learning rate set to 0.01 and a fixed number of iterations set to 10,000.

During training, the demo displays the mean squared error between predicted output values and correct output values, every 2,000 iterations. Training error is difficult to interpret and is monitored mostly to see if something really strange happens (which is fairly common). In this case, the error seems to stabilize after about 4,000 iterations.

After training, the demo code displays the 73 weights and biases values, again mostly as a sanity check. For time-series regression problems, you must typically use a custom accuracy metric. Here, a correct prediction is one where the unnormalized predicted passenger count is plus or minus 30 from the actual count. With that definition, the demo program achieved 91.43 percent accuracy, which is 128 correct and 12 wrong for the 140 predicted passenger counts.

The demo concludes by using the trained neural network to predict the passenger count for January 1961, the first time period past the range of the training data. This is called extrapolation. The prediction is 433 passengers. That value could be used as a predictor variable to forecast February 1961, and so on.

This article assumes you have intermediate or higher programming skills and have a basic knowledge of neural net-works, but doesn’t assume you know anything about time-series regression. The demo program is coded using C#, but you shouldn’t have too much trouble refactoring the code to another language, such as Java or Python. The demo pro-gram is too long to present in its entirety, but the complete source code is available in the file download that accompa-nies this article.

## Time-Series Regression

Time-series regression problems are often displayed using a line chart such as the one in **Figure 2**. The blue line indicates the 144 actual, unnormalized, passenger counts in thousands, from January 1949 through December 1960. The light red line indicates the predicted passenger counts generated by the neural network time-series model. Notice that because the model uses a rolling window with four predictor values, the first predicted passenger count doesn’t occur until month = 5. Additionally, I made forecasts for nine months beyond the range of the training data. These are indicated by the dashed red line.

**Figure 2 Time-Series Regression Line Chart**

In addition to making predictions for times beyond the training data range, time-series regression analyses can be used to identify anomalous data points. This doesn’t occur with the demo passenger count data—you can see the predicted counts match the actual counts quite closely. For example, the actual passenger count for month t = 67 is 302 (the blue dot near the center in **Figure 2**) and the predicted count is 272. But suppose the actual count for month t = 67 was 400. There’d be an obvious visual indication that the actual count for month 67 was an outlier value.

You can also use a programmatic approach for spotting anomalous data with time-series regression. For example, you could flag any time value where the actual data value and the predicted value differed by more than some fixed thresh-old, such as four times the standard deviation of the predicted versus actual data values.

## The Demo Program

To code the demo program, I launched Visual Studio and created a new C# console application and named it Neural-TimeSeries. I used Visual Studio 2015, but the demo program has no significant .NET Framework dependencies, so any recent version will work fine.

After the template code loaded into the editor window, I right-clicked on file Program.cs in the Solution Explorer win-dow and renamed the file to NeuralTimeSeriesProgram.cs, then allowed Visual Studio to automatically rename class Pro-gram for me. At the top of the template-generated code, I deleted all unnecessary using statements, leaving just the one that references the top-level System namespace.

The overall program structure, with a few minor edits to save space, is presented in **Figure 3**.

Figure 3 NeuralTimeSeries Program Structure

```
using System;
namespace NeuralTimeSeries
{
class NeuralTimeSeriesProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin times series demo");
Console.WriteLine("Predict airline passengers ");
Console.WriteLine("January 1949 to December 1960 ");
double[][] trainData = GetAirlineData();
trainData = Normalize(trainData);
Console.WriteLine("Normalized training data:");
ShowMatrix(trainData, 5, 2, true); // first 5 rows
int numInput = 4; // Number predictors
int numHidden = 12;
int numOutput = 1; // Regression
Console.WriteLine("Creating a " + numInput + "-" + numHidden +
"-" + numOutput + " neural network");
NeuralNetwork nn = new NeuralNetwork(numInput, numHidden,
numOutput);
int maxEpochs = 10000;
double learnRate = 0.01;
double[] weights = nn.Train(trainData, maxEpochs, learnRate);
Console.WriteLine("Model weights and biases: ");
ShowVector(weights, 2, 10, true);
double trainAcc = nn.Accuracy(trainData, 0.30);
Console.WriteLine("\nModel accuracy (+/- 30) on training " +
"data = " + trainAcc.ToString("F4"));
double[] future = new double[] { 5.08, 4.61, 3.90, 4.32 };
double[] predicted = nn.ComputeOutputs(future);
Console.WriteLine("January 1961 (t=145): ");
Console.WriteLine((predicted[0] * 100).ToString("F0"));
Console.WriteLine("End time series demo ");
Console.ReadLine();
} // Main
static double[][] Normalize(double[][] data) { . . }
static double[][] GetAirlineData() {. . }
static void ShowMatrix(double[][] matrix, int numRows,
int decimals, bool indices) { . . }
static void ShowVector(double[] vector, int decimals,
int lineLen, bool newLine) { . . }
public class NeuralNetwork { . . }
} // ns
```

The demo uses a simple single-hidden-layer neural network, implemented from scratch. Alternatively, you can use the techniques presented in this article along with a neural network library such as Microsoft Cognitive Toolkit (CNTK).

The demo begins by setting up the training data, as shown in **Figure 4**.

Figure 4 Setting up the Training Data

```
double[][] trainData = GetAirlineData();
trainData = Normalize(trainData);
Console.WriteLine("Normalized training data:");
ShowMatrix(trainData, 5, 2, true);
Method GetAirlineData is defined as:
static double[][] GetAirlineData()
{
double[][] airData = new double[140][];
airData[0] = new double[] { 112, 118, 132, 129, 121 };
airData[1] = new double[] { 118, 132, 129, 121, 135 };
...
airData[139] = new double[] { 606, 508, 461, 390, 432 };
return airData;
}
```

Here, the rolling-window data is hardcoded with a window size of 4. Before writing the time-series program, I wrote a short utility program to generate the rolling-window data from the raw data. In most non-demo scenarios you’d read raw data from a text file, and then programmatically generate rolling-window data, where the window size is parameterized so you could experiment with different sizes.

The Normalize method just divides all data values by a constant 100. I did this purely for practical reasons. My first at-tempts with non-normalized data led to very poor results, but after normalization, my results were much better. In theory, when working with neural networks, your data doesn’t need to be normalized, but in practice normalization often makes a big difference.

The neural network is created like so:

```
int numInput = 4;
int numHidden = 12;
int numOutput = 1;
NeuralNetwork nn =
new NeuralNetwork(numInput, numHidden, numOutput);
```

The number of input nodes is set to four because each rolling window has four predictor values. The number of output nodes is set to one because each set of window values is used to make a prediction for the next month. The number of hidden nodes is set to 12 and was determined by trial and error.

The neural network is trained and evaluated with these statements:

```
int maxEpochs = 10000;
double learnRate = 0.01;
double[] weights = nn.Train(trainData, maxEpochs, learnRate);
ShowVector(weights, 2, 10, true);
```

The Train method uses basic back-propagation. There are many variations, including using momentum or adaptive learning rates to increase training speed, and using L1 or L2 regularization or dropout to prevent model over-fitting. The helper method ShowVector displays a vector with real values formatted to 2 decimals places, 10 values per line.

After the neural network time-series model has been created, its prediction accuracy is evaluated:

```
double trainAcc = nn.Accuracy(trainData, 0.30);
Console.WriteLine("\nModel accuracy (+/- 30) on " +
" training data = " + trainAcc.ToString("F4"));
```

For time-series regression, deciding whether a predicted value is correct or not depends on the problem being investigated. For the airline passenger data, method Accuracy marks a predicted passenger count as correct if the unnormalized predicted count is plus or minus 30 of the actual raw count. For the demo data, the first five predictions, for t = 5 to t = 9 are correct, but the prediction for t = 10 is incorrect:

```
t actual predicted
= = = = = = = = = = =
5 121 129
6 135 128
7 148 137
8 148 153
9 136 140
10 119 141
```

The demo program finishes by using the last four passenger counts (t = 141 to 144) to predict the passenger count for the first time period beyond the range of the training data (t = 145 = January 1961):

```
double[] predictors = new double[] { 5.08, 4.61, 3.90, 4.32 };
double[] forecast = nn.ComputeOutputs(predictors);
Console.WriteLine("Predicted for January 1961 (t=145): ");
Console.WriteLine((forecast[0] * 100).ToString("F0"));
Console.WriteLine("End time series demo");
```

Notice that because the time-series model was trained using normalized data (divided by 100), the predictions will also be normalized, so the demo displays the predicted values times 100.

## Neural Networks for Time-Series Analyses

When you define a neural network you must specify the activation functions used by the hidden-layer nodes and by the output-layer nodes. Briefly, I recommend using the hyperbolic tangent (tanh) function for hidden activation, and the iden-tity function for output activation.

When using a neural network library or system such as Microsoft CNTK or Azure Machine Learning, you must explicitly specify the activation functions. The demo program hardcodes these activation functions. The key code occurs in method ComputeOutputs. The hidden node values are computed like so:

```
for (int j = 0; j < numHidden; ++j)
for (int i = 0; i < numInput; ++i)
hSums[j] += this.iNodes[i] * this.ihWeights[i][j];
for (int i = 0; i < numHidden; ++i) // Add biases
hSums[i] += this.hBiases[i];
for (int i = 0; i < numHidden; ++i) // Apply activation
this.hNodes[i] = HyperTan(hSums[i]); // Hardcoded
```

Here, function HyperTan is program-defined to avoid extreme values:

```
private static double HyperTan(double x) {
if (x < -20.0) return -1.0; // Correct to 30 decimals
else if (x > 20.0) return 1.0;
else return Math.Tanh(x);
}
```

A reasonable, and common, alternative to using tanh for hidden-node activation is to use the closely related logistic sigmoid function. For example:

```
private static double LogSig(double x) {
if (x < -20.0) return 0.0; // Close approximation
else if (x > 20.0) return 1.0;
else return 1.0 / (1.0 + Math.Exp(x));
}
```

Because the identity function is just f(x) = x, using it for output-node activation is just a fancy way of saying don’t use any explicit activation. The demo code in method ComputeOutputs is:

```
for (int j = 0; j < numOutput; ++j)
for (int i = 0; i < numHidden; ++i)
oSums[j] += hNodes[i] * hoWeights[i][j];
for (int i = 0; i < numOutput; ++i) // Add biases
oSums[i] += oBiases[i];
Array.Copy(oSums, this.oNodes, oSums.Length);
```

The sum of products for an output node is copied directly into the output node without applying an explicit activation. Note that the oNodes member of the NeuralNetwork class is an array with one cell, rather than a single variable.

The choice of activation functions affects the code in the back-propagation algorithm implemented in method Train. Method Train uses the calculus derivatives of each activation function. The derivative of y = tanh(x) is (1 + y) * (1 - y). In the demo code:

```
// Hidden node signals
for (int j = 0; j < numHidden; ++j) {
derivative = (1 + hNodes[j]) * (1 - hNodes[j]); // tanh
double sum = 0.0;
for (int k = 0; k < numOutput; ++k)
sum += oSignals[k] * hoWeights[j][k];
hSignals[j] = derivative * sum;
}
```

If you use logistic sigmoid activation, the derivative of y = logsig(x) is y * (1 - y). For output activation, the calculus de-rivative of y = x is just the constant 1. The relevant code in method Train is:

```
for (int k = 0; k < numOutput; ++k) {
errorSignal = tValues[k] - oNodes[k];
derivative = 1.0; // For Identity activation
oSignals[k] = errorSignal * derivative;
}
```

Obviously, multiplying by 1 has no effect. I coded as I did to act as a form of documentation.

## Wrapping Up

There are many different techniques you can use to perform time-series regression analyses. The Wikipedia article on the topic lists dozens of techniques, classified in many ways, such as parametric vs. non-parametric and linear vs. non-linear. In my opinion, the main advantage of using a neural network approach with rolling-window data is that the re-sulting model is often (but not always) more accurate than non-neural models. The main disadvantage of the neural net-work approach is that you must experiment with the learning rate to get good results.

Most time-series regression-analysis techniques use rolling-window data, or a similar scheme. However, there are ad-vanced techniques that can use raw data, without windowing. In particular, a relatively new approach uses what’s called a long short-term memory neural network. This approach often produces very accurate predictive models.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: John Krumm, Chris Lee and Adith Swaminathan