April 2018

Volume 33 Number 4

### [Test Run]

# Understanding LSTM Cells Using C#

A long short-term memory (LSTM) cell is a small software component that can be used to create a recurrent neural network that can make predictions relating to sequences of data. LSTM networks have been responsible for major breakthroughs in several areas of machine learning. In this article, I demonstrate how to implement an LSTM cell using C#. Although it’s unlikely you’ll ever need to create a recurrent neural network from scratch, understanding exactly how LSTM cells work will help you if you ever need to create an LSTM network using a code library such as Microsoft CNTK or Google TensorFlow.

The screenshot in **Figure 1** shows where this article is headed. The demo program creates an LSTM cell that accepts an input vector of size n = 2, and generates an explicit output vector of size m = 3 and a cell state vector of size m = 3. A key characteristic of LSTM cells is that they maintain a state.

**Figure 1 LSTM Cell Input-Output Demo**

An LSTM cell has (4 * n * m) + (4 * m * m) weights and (4 * m) biases. Weights and biases are just constants, with values like 0.1234, that define the behavior of the LSTM cell. The demo has 60 weights and 12 biases that are set to arbitrary values.

The demo sends input (1.0, 2.0) to the LSTM cell. The computed output is (0.0629, 0.0878, 0.1143) and the new cell state is (0.1143, 0.1554, 0.1973). I’ll explain what these numeric values might represent shortly, but for now the point is that an LSTM cell accepts input, produces output and updates its state. The values of the weights and biases haven’t changed.

Next, the demo feeds (3.0, 4.0) to the LSTM cell. The new computed output is (0.1282, 0.2066, 0.2883). The new cell state is (0.2278, 0.3523, 0.4789). Although it’s not apparent, the new cell state value contains information about all previous input and output values. This is the “long” part of long short-term memory.

This article assumes you have intermediate or better programming skills, but doesn’t assume you know anything about LSTM cells. The demo is coded using C#, but you should be able to refactor the code to another language, such as Python or Visual Basic, if you wish. The demo code is just a bit too long to present in its entirety, but the complete code is available in the accompanying file download.

## Understanding LSTM Cells

There are three ways to describe LSTM cells: using an architecture diagram as shown in **Figure 2**; using math equations as shown in **Figure 3**; and using code as presented in this article. Your initial impression of the architecture diagram in **Figure 2** is probably something along the lines of, "What the heck?" The key ideas are that an LSTM cell accepts an input vector x(t), where the t stands for a time. The explicit output is h(t). The unusual use of h (rather than o) to represent output is historical and comes from the fact that neural systems were often described as functions g and h.

The LSTM cell also uses h(t-1) and c(t-1) as inputs. Here, the t-1 means from the previous time step. The c represents the cell state, so h(t-1) and c(t-1) are the previous output values and the previous state values. The interior of the LSTM cell architecture looks complicated. And it is, but not as complicated as you might guess.

The math equations in **Figure 3** define the behavior of the demo program LSTM cell. If you don’t regularly work with math definitions, your reaction to **Figure 3** is likely, again, "What the heck?" Equations (1), (2) and (3) define three gates: a forget gate, an input gate and an output gate. Each gate is a vector of values between 0.0 and 1.0, which are used to determine how much information to forget (or, equivalently, remember) in each input-output cycle. Equation (4) computes the new cell state, and equation (5) computes the new output.

**Figure 2 LSTM Cell Architecture**

**Figure 3 LSTM Cell Math Equations**

These equations are simpler than they appear. For example, equation (1) is implemented by the demo code as:

```
float[][] ft = MatSig(MatSum(MatProd(Wf, xt),
MatProd(Uf, h_prev), bf));
```

Here, MatSig is a program-defined function that applies logistic-sigmoid to each value in a matrix. MatSum adds three matrices. MatProd multiplies two matrices. Once you understand basic matrix and vector operations, implementing an LSTM cell is quite easy.

## Overall Demo Program Structure

The structure of the demo program, with a few minor edits to save space, is presented in **Figure 4**. The demo uses a static method approach rather than an OOP approach to keep the main ideas as clear as possible.

Figure 4 Demo Program Structure

```
using System;
namespace LSTM_IO
{
class LSTM_IO_Program
{
static void Main(string[] args)
{
Console.WriteLine("Begin LSTM IO demo");
// Set up inputs
// Set up weights and biases
float[][] ht, ct; // Outputs, new state
float[][][] result;
result = ComputeOutputs(xt, h_prev, c_prev,
Wf, Wi, Wo, Wc, Uf, Ui, Uo, Uc, bf, bi, bo, bc);
ht = result[0]; // Outputs
ct = result[1]; // New state
Console.WriteLine("Output is:");
MatPrint(ht, 4, true);
Console.WriteLine("New cell state is:");
MatPrint(ct, 4, true);
// Set up new inputs
// Call ComputeOutputs again
Console.WriteLine("End LSTM demo");
}
static float[][][] ComputeOutputs(float[][] xt,
float[][] h_prev, float[][] c_prev,
float[][] Wf, float[][] Wi, float[][] Wo, float[][] Wc,
float[][] Uf, float[][] Ui, float[][] Uo, float[][] Uc,
float[][] bf, float[][] bi, float[][] bo, float[][] bc)
{ . . }
// Helper matrix functions defined here
}
}
```

To create the demo, I launched Visual Studio and created a new C# console application named LSTM_IO. I used Visual Studio 2015, but the demo has no significant .NET Framework dependencies, so any version of Visual Studio will work fine.

After the template code loaded, in the Solution Explorer window I renamed file Program.cs to LSTM_IO_Program.cs and allowed Visual Studio to automatically rename class Program for me. At the top of the editor window, I deleted all unneeded references to namespaces, leaving just the one to the top-level System namespace. All the work is performed by function ComputeOutputs.

## Matrices Using C#

In order to implement an LSTM cell, you must have a solid grasp of working with C# matrices. In C#, a matrix is an array-of-arrays. In machine learning, it’s common to use 32-bit type float rather than 64-bit byte double.

The demo defines a helper to create a matrix as:

```
static float[][] MatCreate(int rows, int cols)
{
float[][] result = new float[rows][];
for (int i = 0; i < rows; ++i)
result[i] = new float[cols];
return result;
}
```

The first statement creates an array with the specified number of rows, where each row is an array of type float. The for-loop statement allocates each row as an array with the specified number of columns. Note that unlike most programming languages, C# supports a true matrix type, but using an array-of-array approach is much more common.

In machine learning, the term column vector, or just vector for short, refers to a matrix with one column. Most machine leaning code works with vectors rather than one-dimensional arrays. The demo defines a function to generate a matrix/vector from a one-dimensional array:

```
static float[][] MatFromArray(float[] arr, int rows, int cols)
{
float[][] result = MatCreate(rows, cols);
int k = 0;
for (int i = 0; i < rows; ++i)
for (int j = 0; j < cols; ++j)
result[i][j] = arr[k++];
return result;
}
```

The function can be called to create a 3x1 (3 rows, 1 column) vector like this:

```
float[][] v = MatFromArray(new float[] {1.0f, 9.0f, 5.0f}, 3,1);
```

The demo defines a MatCopy function that can duplicate a matrix. MatCopy can be called:

```
float[][] B = MatCopy(A);
```

Here, B is a new, independent matrix with the same values as A. Be careful of code like:

```
float[][] B = A;
```

This creates B as a reference to A, so any change made to either matrix affects the other. This may be the behavior you want, but probably not.

## Matrix Element-Wise Operations

An LSTM cell implementation uses several element-wise functions on matrices, where each value in a matrix is used or modified. For example, function MatTanh is defined:

```
static float[][] MatTanh(float[][] m)
{
int rows = m.Length; int cols = m[0].Length;
float[][] result = MatCreate(rows, cols);
for (int i = 0; i < rows; ++i) // Each row
for (int j = 0; j < cols; ++j) // Each col
result[i][j] = Tanh(m[i][j]);
return result;
}
```

The function traverses its input matrix m and applies the hyperbolic tangent (tanh) to each value. Helper function Tanh is defined:

```
static float Tanh(float x)
{
if (x < -10.0) return -1.0f;
else if (x > 10.0) return 1.0f;
return (float)(Math.Tanh(x));
}
```

The demo also defines a MatSigmoid function that’s exactly like MatTanh except logistic-sigmoid is applied to each value. The logistic-sigmoid function is closely related to tanh and returns a value between 0.0 and 1.0 instead of between -1.0 and +1.0.

The demo defines a function MatSum that adds the values in two matrices of the same shape. If you look at math equation (1) in **Figure 3**, you’ll see that an LSTM adds three matrices. The demo overloads MatSum to work with two or three matrices.

Function MatHada multiplies corresponding values in two matrices that have the same shape:

```
static float[][] MatHada(float[][] a, float[][] b)
{
int rows = a.Length; int cols = a[0].Length;
float[][] result = MatCreate(rows, cols);
for (int i = 0; i < rows; ++i)
for (int j = 0; j < cols; ++j)
result[i][j] = a[i][j] * b[i][j];
return result;
}
```

Element-wise multiplication is sometimes called the Hadamard function. In the math equations (4) and (5) in **Figure 3**, the Hadamard function is indicated by the open dot symbol. Don’t confuse the element-wise Hadamard function with matrix multiplication, which is a very different function.

## Matrix Multiplication

If you haven’t seen matrix multiplication before, the operation is not at all obvious. Suppose A is a 3x2 matrix:

```
1.0, 2.0
3.0, 4.0
5.0, 6.0
```

And suppose B is a 2x4 matrix:

```
10.0, 11.0, 12.0, 13.0
14.0, 15.0, 16.0, 17.0
```

The result of C = AB (multiplying A and B) is a 3x4 matrix:

```
38.0 41.0 44.0 47.0
86.0 93.0 100.0 107.0
134.0 145.0 156.0 167.0
```

The demo implements matrix multiplication as function MatProd. Note that when using C#, for very large matrices you can use the Parallel.For statement in the Task Parallel Library.

To summarize, implementing an LSTM cell using C# requires several helper functions that create and operate on matrices (arrays-of-arrays) and vectors (matrices with one column). The demo code defines functions MatCreate, MatFromArray, MatCopy(m), MatSig(m), MatTanh(m), MatHada(a, b), MatSum(a, b), MatSum(a, b, c) and MatProd(a, b). Although not essential for creating an LSTM cell, it’s useful to have a function to display a C# matrix. The demo defines at MatPrint function.

## Implementing and Calling LSTM Input-Output

The code for function ComputeOutputs is presented in **Figure 5**. The function has 15 parameters, but 12 of them are essentially the same.

Figure 5 Function ComputeOutputs

```
static float[][][] ComputeOutputs(float[][] xt,
float[][] h_prev, float[][] c_prev,
float[][] Wf, float[][] Wi, float[][] Wo, float[][] Wc,
float[][] Uf, float[][] Ui, float[][] Uo, float[][] Uc,
float[][] bf, float[][] bi, float[][] bo, float[][] bc)
{
float[][] ft = MatSig(MatSum(MatProd(Wf, xt),
MatProd(Uf, h_prev), bf));
float[][] it = MatSig(MatSum(MatProd(Wi, xt),
MatProd(Ui, h_prev), bi));
float[][] ot = MatSig(MatSum(MatProd(Wo, xt),
MatProd(Uo, h_prev), bo));
float[][] ct = MatSum(MatHada(ft, c_prev),
MatHada(it, MatTanh(MatSum(MatProd(Wc, xt),
MatProd(Uc, h_prev), bc))));
float[][] ht = MatHada(ot, MatTanh(ct));
float[][][] result = new float[2][][];
result[0] = MatCopy(ht);
result[1] = MatCopy(ct);
return result;
}
```

Vector xt is the input, such as (1.0, 2.0). Vectors h_prev and c_prev are the previous output vector and previous cell state. The four W matrices are the gate weights associated with input values, where f is a forget gate, i is an input gate and o is an output gate. The four U matrices are the weights associated with cell output. The four b vectors are biases.

In the body of the function, the first five statements are a one-to-one mapping to the five math equations in **Figure 3**. Notice that the ft, it and ot gates all use the MatSig function. Therefore, all three are vectors with values between 0.0 and 1.0. You can think of these as filters that are applied to input, output or state, where the value in the gate is the percentage retained. For example, if one of the values in ft is 0.75, then 75 percent of the corresponding value in the combined input and previous-output vector is retained. Or, equivalently, 25 percent of the information is forgotten.

The computation of the new cell state, ct, is simple to implement but conceptually quite deep. At a high level, the new cell state depends on a gated combination of the input vector xt, and the previous output vector and cell state, h_prev and c_prev. The new output, ht, depends on the new cell state and the output gate. Quite remarkable.

The function returns the new output and new cell state in an array. This leads to a return type of float[][][], where result[0] is an array-of-arrays matrix holding the output and result[1] holds the new cell state.

Calling ComputeOutputs is mostly a matter of setting up the parameter values. The demo begins the preparation with:

```
float[][] xt = MatFromArray(new float[] {
1.0f, 2.0f }, 2, 1);
float[][] h_prev = MatFromArray(new float[] {
0.0f, 0.0f, 0.0f }, 3, 1);
float[][] c_prev = MatFromArray(new float[] {
0.0f, 0.0f, 0.0f }, 3, 1);
```

Both the previous output and the cell state are explicitly initialized to zero. Next, two sets of arbitrary weight values are created:

```
float[][] W = MatFromArray(new float[] {
0.01f, 0.02f,
0.03f, 0.04f,
0.05f, 0.06f }, 3, 2);
float[][] U = MatFromArray(new float[] {
0.07f, 0.08f, 0.09f,
0.10f, 0.11f, 0.12f,
0.13f, 0.14f, 0.15f }, 3, 3);
```

Notice that the two matrices have different shapes. The weight values are copied to the input parameters:

```
float[][] Wf = MatCopy(W); float[][] Wi = MatCopy(W);
float[][] Wo = MatCopy(W); float[][] Wc = MatCopy(W);
float[][] Uf = MatCopy(U); float[][] Ui = MatCopy(U);
float[][] Uo = MatCopy(U); float[][] Uc = MatCopy(U);
```

Because the weights don’t change, the demo could have assigned by reference instead of using MatCopy. The biases are set up using the same pattern:

```
float[][] b = MatFromArray(new float[] {
0.16f, 0.17f, 0.18f }, 3, 1);
float[][] bf = MatCopy(b); float[][] bi = MatCopy(b);
float[][] bo = MatCopy(b); float[][] bc = MatCopy(b);
```

Function ComputeOutputs is called like this:

```
float[][] ht, ct;
float[][][] result;
result = ComputeOutputs(xt, h_prev, c_prev,
Wf, Wi, Wo, Wc, Uf, Ui, Uo, Uc,
bf, bi, bo, bc);
ht = result[0]; // Output
ct = result[1]; // New cell state
```

The whole point of an LSTM cell is to feed a sequence of input vectors, so the demo sets up and sends a second input vector:

```
h_prev = MatCopy(ht);
c_prev = MatCopy(ct);
xt = MatFromArray(new float[] {
3.0f, 4.0f }, 2, 1);
result = ComputeOutputs(xt, h_prev, c_prev,
Wf, Wi, Wo, Wc, Uf, Ui, Uo, Uc,
bf, bi, bo, bc);
ht = result[0];
ct = result[1];
```

Note that the demo explicitly sends the previous output and state vectors to ComputeOutputs. An alternative is to feed just the new input vector, because the previous output and cell state are still stored in ht and ct.

## Connecting the Dots

So, what’s the point? An LSTM cell can be used to construct an LSTM recurrent neural network—an LSTM cell with some additional plumbing. These networks have been responsible for major advances in prediction systems that work with sequence data. For example, suppose you were asked to predict the next word in the sentence, “In 2017, the championship was won by __.” With just that information, you’d be hard pressed to make a prediction. But suppose your system had state, and remembered that part of a previous sentence was, “The NBA has held a championship game since 1947.” You’d now be in a position to predict one of the 30 NBA teams.

There are dozens of variations of LSTM architectures. Additionally, because LSTM cells are complex, there are dozens of implementation variations for every architecture. But if you understand the basic LSTM cell mechanism, you can easily understand the variations.

The demo program sets the LSTM weights and biases to arbitrary values. The weights and biases for a real-world LSTM network would be determined by training the network. You’d obtain a set of training data with known input values and known, correct output values. Then you’d use an algorithm such as back-propagation to find values for the weights and biases that minimize error between computed outputs and correct outputs.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Ricky Loynd and Adith Swaminathan