다음을 통해 공유


C# Parallel Processing Case Study - Linear Systems Library

In this brief article we add parallel processing to a C# linear systems library using the parallel loops found in C# .NET Framework 4.5.2. The linear systems library has three public classes and eleven public methods. The function of library is to solve potentially large systems of linear equations. The structure of the library is as shown in the following multilevel list where public classes are numbered and publicly accessible methods are lettered:

  1. ClassicalIterativeMethods
    1. Jacobi
    2. GaussSeidel
    3. SOR
    4. GradientMethod
  2. DirectMethods
    1. GaussianElimination
    2. LU Decomposition
    3. CholeskyDecomposition
    4. SimpleGaussianElimination
  3. ModernIterativeMethods
    1. ConjugateGradient
    2. ModifiedRichardson

So we have ten different ways of potentially solving a system of linear equations. The library testing application uses one of five different types of highly scalable matrices:

  1. Pascal
  2. Other
  3. Cauchy
  4. Lehmer
  5. Tri-Diag 1

The matrix creation C# snippet is displayed next:

 

if (matrix == 1)
{
    // Pascal matrix
 
    for (int  i = 0; i < n; i++)
        b[i] = Math.Pow(2, i + 1);
 
    for (int  i = 0; i < n; i++)
        for (int  j = 0; j < n; j++)
            a[i, j] = Binomial(i + j, j);
}
 
else if (matrix == 2)
{
    // other
 
    b[0] = -0.5;
 
    for (int  i = 1; i < n - 1; i++)
        b[i] = -1.5;
 
    b[n - 1] = +0.5;
 
    for (int  i = 0; i < n; i++)
    {
        a[i, i] = -2.0;
 
        if (i < n - 1)
            a[i, i + 1] = 1.0;
 
        if (i > 0)
            a[i, i - 1] = 1.0;
    }
}
 
else if (matrix == 3)
{
    // Cauchy matrix
 
    for (int  i = 0; i < n; i++)
    {
        double xi = 2 * i;
 
        for (int  j = 0; j < n; j++)
        {
            double yj = 2 * j + 1;
 
            a[i, j] = 1.0 / (xi + yj);
        }
 
        b[i] = i + 1;
    }
}
 
else if (matrix == 4)
{
    // Lehmer matrix
 
    for (int  i = 0; i < n; i++)
    {
        for (int  j = 0; j < n; j++)
            a[i, j] = Math.Min(i + 1, j + 1) + Math.Max(i + 1, j + 1);
 
        a[i, i] = i + 1;
        b[i] = i;
    }
}
 
else if (matrix == 5)
{
    // tri-diagnonal 1
 
    a[0, 0] = +2.0;
    a[0, 1] = -1.0;
    b[0] = 1;
 
    for (int  i = 1; i < n - 1; i++)
    {
        a[i, i - 1] = -1.0;
        a[i, i] = 2.0;
        a[i, i + 1] = -1.0;
        b[i] = 0;
    }
 
    a[n - 1, n - 2] = -1.0;
    a[n - 1, n - 1] = +2.0;
    b[n - 1] = 0;
}

We require that the maximum dimension of the system equations be such that the system is usually solved if possible in less about a minute or so via the sequential methods and the root-mean-square error be of the order 1.0e-10 or less. The following table summarizes the results of these experiments for the direct methods:

Method/Matrix Type

Pascal

Other

Cauchy

Lehmer

Tri-Diag 1

Gaussian Elimination

10

500

8

2000

2500

LU Decomposition

10

1000

8

900

2500

Cholesky Decomposition

10

Failed

8

Failed

2500

Simple Gaussian Elimination

10

Failed

2

2

Failed

 

From the above table we see that Simple Gaussian Elimination can be eliminated from further consideration so to speak. We also can get rid of the Pascal and Cauchy matrix type for the direct methods. Next we give an analogous table for the classical iterative methods.

Method/Matrix Type

Pascal

Other

Cauchy

Lehmer

Tri-Diag 1

Jacobi

Failed

Failed

Failed

Failed

Failed

Gauss-Seidel

6

4

4

Failed

5

SOR

5

5

5

Failed

5

Gradient Method

5

5

3

Failed

5

 

We now throw out the classical iterative methods due to their overwhelming dependence upon good initial guesses of the solutions. Next we move forward in time to the modern iterative methods:

Method/Matrix Type

Pascal

Other

Cauchy

Lehmer

Tri-Diag 1

Modified Richardson

4

5

3

Failed

5

Conjugate Gradient

10

Failed

6

Failed

2500

 

At this point we are left with two candidates for the "Other" matrix type and four candidates for the "Tri-Diag 1" matrix type. So we try to introduce parallel processing into the following sequential methods:

  1. Gaussian Elimination
  2. LU Decomposition
  3. Cholesky Decomposition
  4. Conjugate Gradient

Our Gaussian elimination algorithm uses scaled partial pivoting and was translated from the FORTRAN 77 code given in Elementary Numerical Analysis an Algorithmic Approach by S.D. Conte and Carl de Boor. There is only one for loop in whole method that could be benefit from a parallel loop construct and is found to be the working storage element swap loop shown in its parallel glory below:

 

// make k the pivot row by
// interchanging with i_star
flag *= -1;
i = pivot[i_star];
pivot[i_star] = pivot[k];
pivot[k] = i;
temp = d[i_star];
d[i_star] = d[k];
d[k] = temp;
 
Parallel.For(0, n, u =>
{
    temp = w[i_star, u];
    w[i_star, u] = w[k, u];
    w[k, u] = temp;
});

A study of the LU Decomposition code shows that it is not a good method for parallel processing. The Cholesky Decomposition code was translated from the C code found in A Numerical Library in C for Scientists and Engineers by H.T. Lau, PhD. There are three little private methods that lend themselves to parallel loops and are illustrated next:

 

private double matvec(int l, int  u, int i, double[,] a, double[] b)
{
    double s = 0.0;
 
    Parallel.For(l, u + 1, k => { s += a[i, k] * b[k]; });
    return (s);
}
 
private double tammat(int l, int  u, int i, int j,  double[,] a,  double[,] b)
{
    double s = 0.0;
 
    Parallel.For(l, u + 1, k => { s += a[k, i] * b[k, j]; });
    return (s);
}
 
double tamvec(int l, int  u, int i, double[,] a, double[] b)
{
    double s = 0.0;
 
    Parallel.For(l, u + 1, k => { s += a[k, i] * b[k]; });
    return (s);
}

The Conjugate Gradient method is from equations given in Numerical Approximation of Partial Differential Equations by Alfio Quarteroni and Alberto Valli. The parallel loop version is given in the following code block:

 

private double InnerProduct(int n, double[] a, double[] b)
{
    double sum = 0.0;
 
    for (int  i = 0; i < n; i++)
        sum += a[i] * b[i];
 
    return sum;
}
 
public double ConjugateGradient(
    int n,
    int maxIterations,
    double tolerance,
    double[,] a,
    double[] b,
    double[] x)
{
    double t = 0.0;
    double alpha = 0;
    double rzold = 0.0;
    double rznew = 0.0;
    double[] p = new  double[n];
    double[] r = new  double[n];
    double[] z = new  double[n];
    double[] ap = new  double[n];
    double[,] pinv = new  double[n, n];
    int its = 0;
    DirectMethods dm = new DirectMethods();
 
    //for (int i = 0; i < n; i++)
    //  for (int j = 0; j < n; j++)
    //      pinv[i, j] = a[i, j];
 
    Parallel.For(0, n, u =>
    {
        Parallel.For(0, n, v =>
        {
            pinv[u, v] = a[u, v];
        });
    });     
 
    dm.CholeskyInverse(pinv, n);
 
    for (int  i = 0; i < n; i++)
        for (int  j = i + 1; j < n; j++)
            pinv[j, i] = pinv[i, j];
 
    for (int  i = 0; i < n; i++)
    {
        double sum = 0.0;
 
        for (int  j = 0; j < n; j++)
            sum += pinv[i, j] * b[j];
 
        x[i] = sum;
    }
 
    for (int  i = 0; i < n; i++)
    {
        double sum = 0.0;
 
        for (int  j = 0; j < n; j++)
            sum += a[i, j] * x[j];
 
        r[i] = b[i] - sum;
 
        sum = 0.0;
 
        for (int  j = 0; j < n; j++)
            sum += pinv[i, j] * r[j];
 
        z[i] = sum;
        p[i] = z[i];
    }
 
    rzold = InnerProduct(n, r, z);
 
    while (its < maxIterations)
    {
        its++;
 
        for (int  i = 0; i < n; i++)
            for (int  j = 0; j < n; j++)
                ap[i] = a[i, j] * p[j];
 
        double denom = InnerProduct(n, p, ap);
 
        if (denom < tolerance)
            break;
 
        alpha = rzold / denom;
 
        for (int  i = 0; i < n; i++)
            x[i] += alpha * p[i];
 
        for (int  i = 0; i < n; i++)
            r[i] -= alpha * ap[i];
 
        t = Math.Sqrt(InnerProduct(n, r, r));
 
        if (t < tolerance)
            break;
 
        for (int  i = 0; i < n; i++)
        {
            double sum = 0.0;
 
            for (int  j = 0; j < n; j++)
                sum += pinv[i, j] * r[j];
 
            z[i] = sum;
        }
 
        rznew = InnerProduct(n, r, z);
 
        //for (int i = 0; i < n; i++)
        //  p[i] = z[i] + p[i] * rznew / rzold;
 
        Parallel.For(0, n, u =>
        {
            p[u] = z[u] + p[u] * rznew / rzold;
        });
 
        rzold = rznew;
    }
 
    return t;
}

Now we are to present the results of our experiments. First we give the runtimes in seconds for the Direct Methods on the "Other" matrix type:

Method/n

100

200

300

400

500

Gaussian Elimination

0.003

0.019

0.061

0.143

0.283

Parallel Gaussian Elimination

0.003

0.020

0.071

0.155

0.354

LU Decomposition

0.003

0.015

0.056

0.132

0.257

 

Finally, the results for the "Tri-Diag 1" matrix type:

Method/n

500

1000

1500

2000

2500

Gaussian Elimination

0.280

2.216

7.461

17.629

34.445

Parallel Gaussian Elimination

0.300

2.417

8.077

19.101

37.380

LU Decomposition

0.257

2.128

8.017

20.725

49.082

Cholesky Decomposition

0.141

1.203

4.527

13.634

31.199

Parallel Cholesky Decomposition

4.429

21.718

--------

--------

--------

Conjugate Gradient

0.356

2.944

11.313

32.099

69.329

Parallel Conjugate Gradient

0.476

2.947

11.586

32.300

69.586

 

In conclusion for the methods studied, the parallel implementations created, and the test matrices the parallel versions are outperformed by the sequential variants in almost every case. Whether this outperformance is statistically significant is left as an exercise for an aspiring statistician (hint: f-tests first then Student's t-tests). The parallel Cholesky Decomposition code was ill advised and just bad code analysis and programming. We mention in passing that block diagonal systems of equations lend themselves to the multitasking paradigm.