C# Parallel Processing Case Study - Linear Systems Library

아티클
1/17/2024

In this brief article we add parallel processing to a C# linear systems library using the parallel loops found in C# .NET Framework 4.5.2. The linear systems library has three public classes and eleven public methods. The function of library is to solve potentially large systems of linear equations. The structure of the library is as shown in the following multilevel list where public classes are numbered and publicly accessible methods are lettered:

ClassicalIterativeMethods
1. Jacobi
2. GaussSeidel
3. SOR
4. GradientMethod
DirectMethods
1. GaussianElimination
2. LU Decomposition
3. CholeskyDecomposition
4. SimpleGaussianElimination
ModernIterativeMethods
1. ConjugateGradient
2. ModifiedRichardson

So we have ten different ways of potentially solving a system of linear equations. The library testing application uses one of five different types of highly scalable matrices:

Pascal
Other
Cauchy
Lehmer
Tri-Diag 1

The matrix creation C# snippet is displayed next:

if (matrix == 1)
{
    // Pascal matrix
 
    for (int  i = 0; i < n; i++)
        b[i] = Math.Pow(2, i + 1);
 
    for (int  i = 0; i < n; i++)
        for (int  j = 0; j < n; j++)
            a[i, j] = Binomial(i + j, j);
}
 
else if (matrix == 2)
{
    // other
 
    b[0] = -0.5;
 
    for (int  i = 1; i < n - 1; i++)
        b[i] = -1.5;
 
    b[n - 1] = +0.5;
 
    for (int  i = 0; i < n; i++)
    {
        a[i, i] = -2.0;
 
        if (i < n - 1)
            a[i, i + 1] = 1.0;
 
        if (i > 0)
            a[i, i - 1] = 1.0;
    }
}
 
else if (matrix == 3)
{
    // Cauchy matrix
 
    for (int  i = 0; i < n; i++)
    {
        double xi = 2 * i;
 
        for (int  j = 0; j < n; j++)
        {
            double yj = 2 * j + 1;
 
            a[i, j] = 1.0 / (xi + yj);
        }
 
        b[i] = i + 1;
    }
}
 
else if (matrix == 4)
{
    // Lehmer matrix
 
    for (int  i = 0; i < n; i++)
    {
        for (int  j = 0; j < n; j++)
            a[i, j] = Math.Min(i + 1, j + 1) + Math.Max(i + 1, j + 1);
 
        a[i, i] = i + 1;
        b[i] = i;
    }
}
 
else if (matrix == 5)
{
    // tri-diagnonal 1
 
    a[0, 0] = +2.0;
    a[0, 1] = -1.0;
    b[0] = 1;
 
    for (int  i = 1; i < n - 1; i++)
    {
        a[i, i - 1] = -1.0;
        a[i, i] = 2.0;
        a[i, i + 1] = -1.0;
        b[i] = 0;
    }
 
    a[n - 1, n - 2] = -1.0;
    a[n - 1, n - 1] = +2.0;
    b[n - 1] = 0;
}

We require that the maximum dimension of the system equations be such that the system is usually solved if possible in less about a minute or so via the sequential methods and the root-mean-square error be of the order 1.0e-10 or less. The following table summarizes the results of these experiments for the direct methods:

Method/Matrix Type	Pascal	Other	Cauchy	Lehmer	Tri-Diag 1
Gaussian Elimination	10	500	8	2000	2500
LU Decomposition	10	1000	8	900	2500
Cholesky Decomposition	10	Failed	8	Failed	2500
Simple Gaussian Elimination	10	Failed	2	2	Failed

From the above table we see that Simple Gaussian Elimination can be eliminated from further consideration so to speak. We also can get rid of the Pascal and Cauchy matrix type for the direct methods. Next we give an analogous table for the classical iterative methods.

Method/Matrix Type	Pascal	Other	Cauchy	Lehmer	Tri-Diag 1
Jacobi	Failed	Failed	Failed	Failed	Failed
Gauss-Seidel	6	4	4	Failed	5
SOR	5	5	5	Failed	5
Gradient Method	5	5	3	Failed	5

We now throw out the classical iterative methods due to their overwhelming dependence upon good initial guesses of the solutions. Next we move forward in time to the modern iterative methods:

Method/Matrix Type	Pascal	Other	Cauchy	Lehmer	Tri-Diag 1
Modified Richardson	4	5	3	Failed	5
Conjugate Gradient	10	Failed	6	Failed	2500

At this point we are left with two candidates for the "Other" matrix type and four candidates for the "Tri-Diag 1" matrix type. So we try to introduce parallel processing into the following sequential methods:

Gaussian Elimination
LU Decomposition
Cholesky Decomposition
Conjugate Gradient

Our Gaussian elimination algorithm uses scaled partial pivoting and was translated from the FORTRAN 77 code given in Elementary Numerical Analysis an Algorithmic Approach by S.D. Conte and Carl de Boor. There is only one for loop in whole method that could be benefit from a parallel loop construct and is found to be the working storage element swap loop shown in its parallel glory below:

// make k the pivot row by
// interchanging with i_star
flag *= -1;
i = pivot[i_star];
pivot[i_star] = pivot[k];
pivot[k] = i;
temp = d[i_star];
d[i_star] = d[k];
d[k] = temp;
 
Parallel.For(0, n, u =>
{
    temp = w[i_star, u];
    w[i_star, u] = w[k, u];
    w[k, u] = temp;
});

A study of the LU Decomposition code shows that it is not a good method for parallel processing. The Cholesky Decomposition code was translated from the C code found in A Numerical Library in C for Scientists and Engineers by H.T. Lau, PhD. There are three little private methods that lend themselves to parallel loops and are illustrated next:

private double matvec(int l, int  u, int i, double[,] a, double[] b)
{
    double s = 0.0;
 
    Parallel.For(l, u + 1, k => { s += a[i, k] * b[k]; });
    return (s);
}
 
private double tammat(int l, int  u, int i, int j,  double[,] a,  double[,] b)
{
    double s = 0.0;
 
    Parallel.For(l, u + 1, k => { s += a[k, i] * b[k, j]; });
    return (s);
}
 
double tamvec(int l, int  u, int i, double[,] a, double[] b)
{
    double s = 0.0;
 
    Parallel.For(l, u + 1, k => { s += a[k, i] * b[k]; });
    return (s);
}

The Conjugate Gradient method is from equations given in Numerical Approximation of Partial Differential Equations by Alfio Quarteroni and Alberto Valli. The parallel loop version is given in the following code block:

private double InnerProduct(int n, double[] a, double[] b)
{
    double sum = 0.0;
 
    for (int  i = 0; i < n; i++)
        sum += a[i] * b[i];
 
    return sum;
}
 
public double ConjugateGradient(
    int n,
    int maxIterations,
    double tolerance,
    double[,] a,
    double[] b,
    double[] x)
{
    double t = 0.0;
    double alpha = 0;
    double rzold = 0.0;
    double rznew = 0.0;
    double[] p = new  double[n];
    double[] r = new  double[n];
    double[] z = new  double[n];
    double[] ap = new  double[n];
    double[,] pinv = new  double[n, n];
    int its = 0;
    DirectMethods dm = new DirectMethods();
 
    //for (int i = 0; i < n; i++)
    //  for (int j = 0; j < n; j++)
    //      pinv[i, j] = a[i, j];
 
    Parallel.For(0, n, u =>
    {
        Parallel.For(0, n, v =>
        {
            pinv[u, v] = a[u, v];
        });
    });     
 
    dm.CholeskyInverse(pinv, n);
 
    for (int  i = 0; i < n; i++)
        for (int  j = i + 1; j < n; j++)
            pinv[j, i] = pinv[i, j];
 
    for (int  i = 0; i < n; i++)
    {
        double sum = 0.0;
 
        for (int  j = 0; j < n; j++)
            sum += pinv[i, j] * b[j];
 
        x[i] = sum;
    }
 
    for (int  i = 0; i < n; i++)
    {
        double sum = 0.0;
 
        for (int  j = 0; j < n; j++)
            sum += a[i, j] * x[j];
 
        r[i] = b[i] - sum;
 
        sum = 0.0;
 
        for (int  j = 0; j < n; j++)
            sum += pinv[i, j] * r[j];
 
        z[i] = sum;
        p[i] = z[i];
    }
 
    rzold = InnerProduct(n, r, z);
 
    while (its < maxIterations)
    {
        its++;
 
        for (int  i = 0; i < n; i++)
            for (int  j = 0; j < n; j++)
                ap[i] = a[i, j] * p[j];
 
        double denom = InnerProduct(n, p, ap);
 
        if (denom < tolerance)
            break;
 
        alpha = rzold / denom;
 
        for (int  i = 0; i < n; i++)
            x[i] += alpha * p[i];
 
        for (int  i = 0; i < n; i++)
            r[i] -= alpha * ap[i];
 
        t = Math.Sqrt(InnerProduct(n, r, r));
 
        if (t < tolerance)
            break;
 
        for (int  i = 0; i < n; i++)
        {
            double sum = 0.0;
 
            for (int  j = 0; j < n; j++)
                sum += pinv[i, j] * r[j];
 
            z[i] = sum;
        }
 
        rznew = InnerProduct(n, r, z);
 
        //for (int i = 0; i < n; i++)
        //  p[i] = z[i] + p[i] * rznew / rzold;
 
        Parallel.For(0, n, u =>
        {
            p[u] = z[u] + p[u] * rznew / rzold;
        });
 
        rzold = rznew;
    }
 
    return t;
}

Now we are to present the results of our experiments. First we give the runtimes in seconds for the Direct Methods on the "Other" matrix type:

Method/n	100	200	300	400	500
Gaussian Elimination	0.003	0.019	0.061	0.143	0.283
Parallel Gaussian Elimination	0.003	0.020	0.071	0.155	0.354
LU Decomposition	0.003	0.015	0.056	0.132	0.257

Finally, the results for the "Tri-Diag 1" matrix type:

Method/n	500	1000	1500	2000	2500
Gaussian Elimination	0.280	2.216	7.461	17.629	34.445
Parallel Gaussian Elimination	0.300	2.417	8.077	19.101	37.380
LU Decomposition	0.257	2.128	8.017	20.725	49.082
Cholesky Decomposition	0.141	1.203	4.527	13.634	31.199
Parallel Cholesky Decomposition	4.429	21.718	--------	--------	--------
Conjugate Gradient	0.356	2.944	11.313	32.099	69.329
Parallel Conjugate Gradient	0.476	2.947	11.586	32.300	69.586

In conclusion for the methods studied, the parallel implementations created, and the test matrices the parallel versions are outperformed by the sequential variants in almost every case. Whether this outperformance is statistically significant is left as an exercise for an aspiring statistician (hint: f-tests first then Student's t-tests). The parallel Cholesky Decomposition code was ill advised and just bad code analysis and programming. We mention in passing that block diagonal systems of equations lend themselves to the multitasking paradigm.

다음을 통해 공유

C# Parallel Processing Case Study - Linear Systems Library

추가 리소스