Times and TransposeTimes

CNTK matrix product.

``````A * B
Times (A, B, outputRank=1)
TransposeTimes (A, B, outputRank=1)
``````

Parameters

• `A` first argument of matrix product. Can be a time sequence.
• `B` second argument of matrix product. Can be a time sequence.
• `outputRank` (default: 1): number of axes of `A` that constitute the output dimension. See 'Extended interpretation for tensors' below.

Return Value

Resulting matrix product (tensor). This is a time sequence if either input was a time sequence.

Description

The `Times()` function implements the matrix product, with extensions for tensors. The `*` operator is a short-hand for it. `TransposeTimes()` transposes the first argument.

If `A` and `B` are matrices (rank-2 tensor) or column vectors (rank-1 tensor), `A * B` will compute the common matrix product, just as one would expect.

`TransposeTimes (A, B)` computes the matrix product `A^T * B`, where `^T` denotes transposition. `TransposeTimes (A, B)` has the same result as `Transpose (A) * B`, but it is more efficient as it avoids a temporary copy of the transposed version of `A`.

Time sequences

Both `A` and `B` can be either single matrices or time sequences. A common case for recurrent networks is that `A` is a weight matrix, while `B` is a sequence of inputs.

Note: If `A` is a time sequence, the operation is not efficient, as it will launch a separate GEMM invocation for every time step. The exception is `TransposeTimes()` where both inputs are column vectors, for which a special optimization exists.

Sparse support

`Times()` and `TransposeTimes()` support sparse matrix. The result is a dense matrix unless both are sparse. The two most important use cases are:

• `B` being a one-hot representation of an input word (or, more commonly, an entire sequence of one-hot vectors). Then, `A * B` denotes a word embedding, where the columns of `A` are the embedding vectors of the words. The following is the recommended way of realizing embeddings in CNTK:

`````````
Embedding (x, dim) = Parameter (dim, 0/*inferred*/) * x
e = Embedding (input, 300)
```
``````
• `A` being a one-hot representation of an label word. The popular cross-entropy criterion and the error counter can be written using `TransposeTimes()` as follows, respectively, where `z` is the input to the top-level Softmax() classifier, and `L` the label sequence which may be sparse:

`````````
CrossEntropyWithSoftmax (L, z) = ReduceLogSum (z) - TransposeTimes (L,          z)
ErrorPrediction         (L, z) = BS.Constants.One - TransposeTimes (L, Hardmax (z))
```
``````

Multiplying with a scalar

The matrix product can not be used to multiply a matrix with a scalar. You will get an error regarding mismatching dimensions. To multiply with a scalar, use the element-wise product `.*` instead. For example, the weighted average of two matrices could be written like this:

``````z = Constant (alpha) .* x + Constant (1-alpha) .* y
``````

Multiplying with a diagonal matrix

If your input matrix is diagonal and stored as a vector, do not use `Times()` but an element-wise multiplication (`ElementTimes()` or the `.*` operator). For example

``````dMat = ParameterTensor {(100:1)}
z = dMat .* v
``````

This leverages broadcasting semantics to multiply every element of `v` with the respective row of `dMat`.

Extended interpretation of matrix product for tensors of rank > 2

If `A` and/or `B` are tensors of higher rank, the `*` operation denotes a generalized matrix product where all but the first dimension of `A` must match with the leading dimensions of `B`, and are interpreted by flattening. For example a product of a `[I x J x K]` and a `[J x K x L]` tensor (which we will abbreviate henceforth as `[I x J x K] * [J x K x L]`) gets reinterpreted by reshaping the two tensors as matrices as `[I x (J * K)] * [(J * K) x L]`, for which the matrix product is defined and yields a result of dimension `[I x L]`. This makes sense if one considers the rows of a weight matrix to be patterns that activation vectors are matched against. The above generalization allows these patterns themselves to be multi-dimensional, such as images or running windows of speech features.

It is also possible to have more than one non-matched dimension in `B`. For example `[I x J] * [J x K x L]` is interpreted as this matrix product: `[I x J] * [J x (K * L)]` which thereby yields a result of dimensions `[I x K x L]`. For example, this allows to apply a matrix to all vectors inside a rolling window of `L` speech features of dimension `J`.

If the result of the product should have multiple dimensions (such as arranging a layer's activations as a 2D field), then instead of using the `*` operator, one must say `Times (A, B, outputRank=m)` where `m` is the number of dimensions in which the 'patterns' are arranged, and which are kept in the output. For example, `Times (tensor of dim [I x J x K], tensor of dim [K x L], outputRank=2)` will be interpreted as the matrix product `[(I * J) x K] * [K x L]` and yield a result of dimensions `[I x J x L]`.