Auto-Parallelization and Auto-Vectorization

Auto-Parallelizer and Auto-Vectorizer are designed to provide automatic performance gains for loops in your code.

Auto-Parallelizer

The /Qpar compiler switch enables automatic parallelization of loops in your code. When you specify this flag without changing your existing code, the compiler evaluates the code to find loops that might benefit from parallelization. Because it might find loops that don't do much work and therefore won't benefit from parallelization, and because every unnecessary parallelization can engender the spawning of a thread pool, extra synchronization, or other processing that would tend to slow performance instead of improving it, the compiler is conservative in selecting the loops that it parallelizes. For example, consider the following example in which the upper bound of the loop is not known at compile time:

void loop_test(int u) {
   for (int i=0; i<u; ++i)
      A[i] = B[i] * C[i];
}

Because u could be a small value, the compiler won't automatically parallelize this loop. However, you might still want it parallelized because you know that u will always be large. To enable the auto-parallelization, specify #pragma loop(hint_parallel(n)), where n is the number of threads to parallelize across. In the following example, the compiler will attempt to parallelize the loop across 8 threads.

void loop_test(int u) {
#pragma loop(hint_parallel(8))
   for (int i=0; i<u; ++i)
      A[i] = B[i] * C[i];
}

As with all pragma directives, the alternate pragma syntax __pragma(loop(hint_parallel(n))) is also supported.

There are some loops that the compiler can't parallelize even if you want it to. Here's an example:

#pragma loop(hint_parallel(8))
for (int i=0; i<upper_bound(); ++i)
    A[i] = B[i] * C[i];

The function upper_bound() might change every time it's called. Because the upper bound cannot be known, the compiler can emit a diagnostic message that explains why it can't parallelize this loop. The following example demonstrates a loop that can be parallelized, a loop that cannot be parallelized, the compiler syntax to use at the command prompt, and the compiler output for each command line option:

int A[1000];
void test() {
#pragma loop(hint_parallel(0))
    for (int i=0; i<1000; ++i) {
        A[i] = A[i] + 1;
    }

    for (int i=1000; i<2000; ++i) {
        A[i] = A[i] + 1;
    }
}

Compiling by using this command:

cl d:\myproject\mylooptest.cpp /O2 /Qpar /Qpar-report:1

yields this output:

--- Analyzing function: void __cdecl test(void)
d:\myproject\mytest.cpp(4) : loop parallelized

Compiling by using this command:

cl d:\myproject\mylooptest.cpp /O2 /Qpar /Qpar-report:2

yields this output:

--- Analyzing function: void __cdecl test(void)
d:\myproject\mytest.cpp(4) : loop parallelized
d:\myproject\mytest.cpp(4) : loop not parallelized due to reason '1008'

Notice the difference in output between the two different /Qpar-report (Auto-Parallelizer Reporting Level) options. /Qpar-report:1 outputs parallelizer messages only for loops that are successfully parallelized. /Qpar-report:2 outputs parallelizer messages for both successful and unsuccessful loop parallelizations.

For more information about reason codes and messages, see Vectorizer and Parallelizer Messages.

Auto-Vectorizer

The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions on the target computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2, AVX, and AVX2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors, according to the /arch switch.

The Auto-Vectorizer may generate different instructions than specified by the /arch switch. These instructions are guarded by a runtime check to make sure that code still runs correctly. For example, when you compile /arch:SSE2, SSE4.2 instructions may be emitted. A runtime check verifies that SSE4.2 is available on the target processor and jumps to a non-SSE4.2 version of the loop if the processor does not support those instructions.

By default, the Auto-Vectorizer is enabled. If you want to compare the performance of your code under vectorization, you can use #pragma loop(no_vector) to disable vectorization of any given loop.

#pragma loop(no_vector)
for (int i = 0; i < 1000; ++i)
   A[i] = B[i] + C[i];

As with all pragma directives, the alternate pragma syntax __pragma(loop(no_vector)) is also supported.

As with the Auto-Parallelizer, you can specify the /Qvec-report (Auto-Vectorizer Reporting Level) command-line option to report either successfully vectorized loops only—/Qvec-report:1—or both successfully and unsuccessfully vectorized loops—/Qvec-report:2).

For more information about reason codes and messages, see Vectorizer and Parallelizer Messages.

For an example showing how the vectorizer works in practice, see Project Austin Part 2 of 6: Page Curling

See also

loop
Parallel Programming in Native Code
/Qpar (Auto-Parallelizer)
/Qpar-report (Auto-Parallelizer Reporting Level)
/Qvec-report (Auto-Vectorizer Reporting Level)
Vectorizer and Parallelizer Messages