Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:
Version 1:
template<typename T>
void SeqMatrixMult1(int size, T** m1, T** m2, T** result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
result[i][j] = 0;
for (int k = 0; k < size; k++) {
result[i][j] += m1[i][k] * m2[k][j];
}
}
}
}
This implementation is straight-forward and you can find it in text book and many online samples.
Version 2:
template<typename T>
void SeqMatrixMult2(int size, T** m1, T** m2, T** result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
T c = 0;
for (int k = 0; k < size; k++) {
c += m1[i][k] * m2[k][j];
}
result[i][j] = c;
}
}
}
This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".
Version 3:
template<typename T>
void Transpose(int size, T** m)
{
for (int i = 0; i < size; i++) {
for (int j = i + 1; j < size; j++) {
std::swap(m[i][j], m[j][i]);
}
}
}
template<typename T>
void SeqMatrixMult3(int size, T** m1, T** m2, T** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
T c = 0;
for (int k = 0; k < size; k++) {
c += m1[i][k] * m2[j][k];
}
result[i][j] = c;
}
}
Transpose(size, m2);
}
This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.
Version 4:
template<typename T>
void SeqMatrixMult4(int size, T** m1, T** m2, T** result);
// assume size % 2 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddpd)
template<>
void SeqMatrixMult4(int size, double** m1, double** m2, double** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
__m128d c = _mm_setzero_pd();
for (int k = 0; k < size; k += 2) {
c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&m1[i][k]), _mm_load_pd(&m2[j][k])));
}
c = _mm_hadd_pd(c, c);
_mm_store_sd(&result[i][j], c);
}
}
Transpose(size, m2);
}
// assume size % 4 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddps)
template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
__m128 c = _mm_setzero_ps();
for (int k = 0; k < size; k += 4) {
c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k])));
}
c = _mm_hadd_ps(c, c);
c = _mm_hadd_ps(c, c);
_mm_store_ss(&result[i][j], c);
}
}
Transpose(size, m2);
}
For float types, we can use SIMD instruction set to parallel the data processing.
Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:
template<typename T>
void ParMatrixMult1(int size, T** m1, T** m2, T** result)
{
using namespace Concurrency;
for (int i = 0; i < size; i++) {
parallel_for(0, size, 1, [&](int j) {
result[i][j] = 0;
for (int k = 0; k < size; k++) {
result[i][j] += m1[i][k] * m2[k][j];
}
});
}
}
Result
Here are the test results (what really matters is the relative time between different version):
Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)
int | long long | float | double | |
Version 1 | 0.931119s | 2.945134s | 0.774894s | 0.984585s |
Version 2 | 0.571003s | 2.310568s | 0.724161s | 0.929064s |
Version 3 | 0.239538s | 0.823095s | 0.570772s | 0.241691s |
Version 4 | N/A | N/A | 0.063196s | 0.187614s |
Version 1 + PPL | 0.847534s | 1.683765s | 0.589513s | 0.994161s |
Version 2 + PPL | 0.380174s | 1.190713s | 0.409321s | 0.594859s |
Version 3 + PPL | 0.135760s | 0.495152s | 0.370499s | 0.185800s |
Version 4 + PPL | N/A | N/A | 0.041959s | 0.157932s |
Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)
int | long long | float | double | |
Version 1 | 0.514330s | 1.434509s | 0.455168s | 0.608127s |
Version 2 | 0.314554s | 1.231696s | 0.447607s | 0.593517s |
Version 3 | 0.180176s | 0.591002s | 0.432129s | 0.149511s |
Version 4 | N/A | N/A | 0.042900s | 0.083286s |
Version 1 + PPL | 0.308766s | 0.482934s | 0.175585s | 0.309159s |
Version 2 + PPL | 0.105717s | 0.325413s | 0.124862s | 0.164156s |
Version 3 + PPL | 0.073418s | 0.193824s | 0.116971s | 0.061268s |
Version 4 + PPL | N/A | N/A | 0.017891s | 0.031734s |
From the results, you can find that:
- Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)
- Eliminating unnecessary memory write (Version 2) helps the parallelism
- Data cache miss can be a big issue when there are lots of memory access (Version 3)
- Using SIMD instead of FPU on aligned data is beneficial (Version 4)
- Different data types, data sizes and host architectures may have different kinds of bottlenecks
Comments
Anonymous
April 28, 2009
PingBack from http://microsoft-sharepoint.simplynetdev.com/optimize-your-code-matrix-multiplication/Anonymous
December 19, 2012
For matrix multiplication AA^T + BB^T, and BA^T - AB^T, is there any optimization method? here, A and B is m*n matrix, T means matrix transpose Thanks.Anonymous
December 28, 2012
The result of A * Tran(A) is symmetric, so you save about 50% of the computation. Similarly, B * Tran(A) - A * Tran(B) = B * Tran(A) - Tran(B * Tran(A)), so only one matrix multiplication is needed.Anonymous
August 07, 2014
nice article, help me a lot, thankyouAnonymous
February 23, 2015
I did in a different way, using OO, pointers and increment operations as follows ahead: // Optimized matrix multiplication R = A * B const Matrix Matrix::operator*(const Matrix &other) const { assert (this->colNum == other.rowNum); // initialize a empty matrix to return Matrix result(this->rowNum, other.colNum, 0.0); // initialize our flags int positionR = 0, positionX = -1, positionA = -1, positionB = -result.colNum; int rowsCount = 0, colsCount = 0; // result elements times iterations needed int limit = result.rowNum * result.colNum * this->colNum; for(int step = 0; step < limit; ++step) { // iteration counter positionX++; // if we reached a complete iteration if(positionX == this->colNum) { positionX = 0; positionR++; colsCount++; // avoid to use modulo operator if(colsCount == result.colNum) { colsCount = 0; rowsCount += result.colNum; } // redefine starts point positionB = colsCount; positionA = rowsCount; } else { // move to forward positions positionA++; positionB += result.colNum; } //access an array using the closest machine run style result.matrixArray[positionR] += this->matrixArray[positionA] * other.matrixArray[positionB]; } /* // TODO otimizar a multiplicação de matrizes for(int positionY = 0; positionY < result.rowNum; ++positionY) for(int positionX = 0; positionX < this->colNum; ++positionX) for(int positionZ = 0; positionZ < other.rowNum; ++positionZ) result.element(positionY, positionX) += this->element(positionY, positionZ) * other.element(positionZ, positionX); */ return result; } Maybe it could be better, without thouse minus initializations. But its enough for me. =DAnonymous
February 25, 2015
The comment has been removedAnonymous
February 25, 2015
The comment has been removedAnonymous
February 25, 2015
change the follow lines: int colSize = this->colNum; by int colSize = other.colNum; i achieved a multiplication by MatrixA[1][3] x MatrixA[3][2], 1.000.000 times in 242 milliseconds, using the E7500 processor.Anonymous
September 20, 2016
I followed most of the vectorization code, but can you please explain use of these:c = _mm_hadd_pd(c, c); OR c = _mm_hadd_ps(c, c);Thank you.- Anonymous
October 06, 2016
The comment has been removed
- Anonymous
Anonymous
April 12, 2017
this article helps me a lot,thank you