Share via


.NET Matters

False Sharing

Stephen Toub

Unless you've been living under a rock, you've likely heard about the "manycore shift." Processor manufacturers such as Intel and AMD are increasing the power of their hardware by scaling out the number of cores on a processor rather than by attempting to continue to provide exponential increases in clock speed. This shift demands that software developers start to write all of their applications with concurrency in mind in order to benefit from these significant increases in computing power.

To address this issue, a multitude of concurrency libraries and languages are beginning to emerge, including Parallel Extensions to the Microsoft .NET Framework, the Parallel Pattern Library (PPL), the Concurrency & Coordination Runtime (CCR), Intel's Threading Building Blocks (TBB), and others. These libraries all aim to decrease the amount of boilerplate code necessary to write efficient parallel applications by providing constructs such as Parallel.For and AsParallel. Unfortunately, while these constructs represent a monumental step forward in expressing parallelism, they don't obviate the need for developers to be aware of what their code is doing, how it's structured, and how hardware can have a significant impact on the performance of the application.

While the software industry is making strides in developing new programming models for concurrency, there does not seem to be a programming model on the horizon that would magically eliminate all concurrency-related issues. At least in the near term, understanding how memory and caches work will be important in order to write efficient parallel programs.

It All Comes Down to Hardware

The concept of knowing what's happening at the lower levels of an application is, of course, not new. To achieve optimal performance, developers need to have a good understanding of how things like memory accesses affect the performance of an application.

When we talk about reading and writing from memory, we typically gloss over the fact that, in modern hardware, it's rare to read from and write to the machine's memory banks directly. Memory access is slow—orders of magnitude slower than mathematical calculations, though orders of magnitude faster than accessing hard disks and network resources.

To account for this slow memory access, most processors today use memory caches to improve application performance. Caches come in multiple levels, with most consumer machines today having at least two levels, referred to as L1 and L2, and some having more than that. L1 is the fastest, but it's also the most expensive, so machines will typically have a small amount of it. (The laptop on which we're writing this column has 128KB of L1 cache.) L2 is a bit slower, but is less expensive, so machines will have more of it. (The same laptop has 2MB of L2 cache.)

When data is read from memory, the requested data as well as data around it (referred to as a cache line) is loaded from memory into the caches, then the program is served from the caches. This loading of a whole cache line rather than individual bytes can dramatically improve application performance. On our laptop the cache line size for both L1 and L2 is 64 bytes. Since applications frequently read bytes sequentially in memory (common when accessing arrays and the like), applications can avoid hitting main memory on every request by loading a series of data in a cache line, since it's likely that the data about to be read has already been loaded into the cache. However, this does mean that a developer needs to be cognizant of how the application accesses memory in order to take the greatest advantage of the cache.

Consider the C# program in Figure 1. We've created a two-dimensional array to which the application then writes in two different ways. In the first code segment (labeled Faster in the comments), the application loops over each row, and within each row it loops over each column. In the second code segment (labeled Slower), the application loops over each column, and within each column it loops over each row. The labels in this case give away the punch line, but on the laptop we tested this on, the Faster version runs almost twice as fast as the Slower version, even though the only difference in code between the two is that the order of the for loops has been swapped.

Figure 1 Memory Access Patterns Are Important

//C#

using System;
using System.Diagnostics;

class Program {
  public static void Main() {
    const int SIZE = 10000;
    int[,] matrix = new int[SIZE, SIZE];

    while (true) {
      // Faster
      Stopwatch sw = Stopwatch.StartNew();
      for (int row = 0; row < SIZE; row++) {
        for (int column = 0; column < SIZE; column++) {
          matrix[row, column] = (row * SIZE) + column;
        }
      }
      Console.WriteLine(sw.Elapsed);

      // Slower
      sw = Stopwatch.StartNew();
      for (int column = 0; column < SIZE; column++) {
        for (int row = 0; row < SIZE; row++) {
          matrix[row, column] = (row * SIZE) + column;
        }
      }
      Console.WriteLine(sw.Elapsed);

      Console.WriteLine("=================");
      Console.ReadLine();
    }
  }
}

‘Visual Basic

Imports System
Imports System.Diagnostics

Friend Class Program
    Public Shared Sub Main()
        Const SIZE = 10000
        Dim matrix(SIZE - 1, SIZE - 1) As Integer

        Do
            ' Faster
            Dim sw = Stopwatch.StartNew()
            For row = 0 To SIZE - 1
                For column = 0 To SIZE - 1
                    matrix(row, column) = (row * SIZE) + column
                Next column
            Next row
            Console.WriteLine(sw.Elapsed)

            ' Slower
            sw = Stopwatch.StartNew()
            For column = 0 To SIZE - 1
                For row = 0 To SIZE - 1
                    matrix(row, column) = (row * SIZE) + column
                Next row
            Next column
            Console.WriteLine(sw.Elapsed)

            Console.WriteLine("=================")
            Console.ReadLine()
        Loop
    End Sub
End Class

This performance differential occurs because of the way the memory for the matrix array is organized. The array is stored contiguously in memory in row-major order, meaning that you can envision each of the rows from the matrix laid out in memory linearly one after the other. When the application accesses a value from the array, the relevant cache line will contain the value as well as other values directly surrounding the one requested, which are likely to be the values immediately before or after in the same row. Thus, accessing the next value from the row should be significantly faster than accessing a value in a different row. In the slower version, we're always accessing values in different rows by jumping from row to row, accessing the same column location in each row. For a real-world example of this phenomenon, see Kenny Kerr's column in this issue of MSDN Magazine at msdn.microsoft.com/magazine/cc850829.

Contrary to Kindergarten Teaching, ­Sharing Can Be Bad

Parallelism makes this issue even more relevant. Not only do we need to be aware of how one processor is accessing data in memory and taking advantage of its caches, but we need to be aware of how multiple threads/processors are accessing the data.

In many systems today, some of the caches are not shared between cores. Since a cache maintains copies of data from main memory, processors need to implement a mechanism for keeping cache contents in sync with both main memory and with other caches; such a mechanism is known as a "cache coherency" or "memory coherency" protocol.

The protocol used by most Intel processors today is known as MESI, named as such to represent the four states that a particular cache line can be in: Modified, Exclusive, Shared, and Invalid. The protocol ensures that a cache monitors what's happening with other caches in the system and takes appropriate action to maintain consistency, such as flushing modified data to main memory when another cache requires that data for a read operation. At a minimum these operations require memory reads/writes that, as we've already stated and demonstrated, can be expensive.

Unfortunately, the problem can be trickier than it might at first seem, as it isn't always obvious when data is being shared. True sharing occurs when two threads specifically require access to a given memory location which is properly kept consistent by the cache coherency protocol. However, unbeknownst to the developer, other sharing can occur where two threads access distinct sets of data, but due to the hardware architecture details such as the size of a cache line, those sets of data still look like sharing as far as the cache coherency protocol is concerned. This is known as false sharing (illustrated in Figure 2), and it can lead to significant performance problems in real-world parallel applications.


Figure 2 Cache Coherency and Cache Lines

Consider the code shown in Figure 3. In this code snippet, we're allocating instances of System.Random, one per processor. Then in parallel (using the Parallel.For construct from the Task Parallel Library that's part of Parallel Extensions), we run that many iterations. Each iteration uses one of the Random instances and does some work on it (in this case, continually calling Next on the Random instance).

Figure 3 False Sharing Example

//C#

static void WithFalseSharing() {
  int iterations = Environment.ProcessorCount;

  Random[] rands = new Random[iterations];
  for (int i = 0; i < iterations; i++) rands[i] = new Random();

  Parallel.For(0, iterations, i => {
    Random r = rands[i];
    DoWork(r);
  });
}

static void DoWork(Random rand) {
  for (int i = 0; i < WORKAMOUNT; i++) rand.Next();
}

‘Visual Basic

Shared Sub WithFalseSharing()
        Dim iterations = Environment.ProcessorCount

        Dim rands(iterations - 1) As Random
        For i = 0 To iterations - 1
            rands(i) = New Random()
        Next i

        Parallel.For(0, iterations, Sub(i)
                                        Dim r = rands(i)
                                        DoWork(r)
                                    End Sub)
    End Sub

    Shared Sub DoWork(ByVal rand As Random)
        For i = 0 To WORKAMOUNT - 1
            rand.Next()
        Next i
    End Sub

It may not be clear in this example where the sharing exists. The application is only ever reading from the rands array, and it's perfectly valid for the same cache line to be in two different caches for reading (with the MESI protocol, the line would likely be marked as Shared with both). The problem is actually in the Random instances themselves. Each instance of Random contains some state in order to produce the next value in its sequence of pseudo-random values, and every call to Next computes the next value, updating that internal state. Due to the way we're allocating these Random instances, they can end up close enough to each other in memory to cause false sharing between instances, even though all of the threads involved in the loop are independent from a source code perspective.

There are several ways to address the false sharing issue, all of which involve allocating the Random instances far enough apart from each other in memory that they won't show up on the same cache line. In .NET, where memory allocation details are left to the runtime, this can require some workarounds.

As one example, the code in Figure 4 shows a version of this code that doesn't suffer from the same problem. Instead, it allocates a bunch of extra Random instances between those that we care about, thus ensuring that the ones we do care about are far enough apart so as to not be subject to false sharing (at least on this machine). For our tests, this updated version produced significantly better results, running up to six times faster than the code from Figure 3 on our dual-core test machine.

Figure 4 Fix for False Sharing Example

//C#

static void WithoutFalseSharing() {
  const int BUFFER = 64;
  int iterations = Environment.ProcessorCount * BUFFER;

  Random[] rands = new Random[iterations];
  for (int i = 0; i < iterations; i++) rands[i] = new Random();

  Parallel.For(0, iterations, BUFFER, i => {
    Random r = rands[i];
    DoWork(r);
  });
}

static void DoWork(Random rand) {
  for (int i = 0; i < WORKAMOUNT; i++) rand.Next();
}

‘Visual Basic

Shared Sub WithoutFalseSharing()
        Const BUFFER = 64
        Dim iterations = Environment.ProcessorCount * BUFFER

        Dim rands(iterations - 1) As Random
        For i = 0 To iterations - 1
            rands(i) = New Random()
        Next i

        Parallel.For(0, iterations, BUFFER, Sub(i)
                                                Dim r = rands(i)
                                                DoWork(r)
                                            End Sub)
    End Sub

    Shared Sub DoWork(ByVal rand As Random)
        For i = 0 To WORKAMOUNT - 1
            rand.Next()
        Next i
    End Sub

Methods for Detecting False Sharing

Unfortunately, it's fairly easy for false sharing to go undetected because there aren't well-isolated indicators that singly point to it as a performance problem. Fortunately, there are some profiling metrics that can be used comparatively to confirm suspicions of false sharing and to help narrow it down in code. Above all, you need to know when to suspect effects of false sharing and you need to be aware of the memory access patterns in concurrent code when hunting it down.

Let's assume that you've identified a concurrent code segment as your performance bottleneck. You may want to focus on shared cache line effects if the following are true of that piece of code:

  • It is relatively CPU and memory bound, such that there are few I/O or blocking OS calls.
  • It doesn't scale well when the concurrency level is increased by running on more powerful hardware.
  • It occasionally performs significantly slower for different input data that requires the exact same amount of processing and number of memory accesses, but with a different pattern of traversal.

The best hints we've found for identifying false sharing effects come from CPU performance counters. These are hardware-level statistics made available by the CPU itself, and they provide information about how different subsystems of the CPU are performing. Examples include the number of L2 cache misses and the number of branch predictions misses. The variations of these counters can be different for each processor model, although there is a common subset that is exposed by most profilers. We are going to demonstrate with a few of these, namely L2 cache misses, instructions retired, and non-halted cycles.

L2 cache misses indicates the number of times a read from the L2 cache resulted in a load from main memory. An L2 miss can happen if the requested data isn't present in the L2 cache, or if the corresponding cache line has been marked as invalid due to an update from another processor. The latter occurs frequently in the case of false sharing, therefore in diagnosing this problem we need to watch for an undue spike in the L2 Cache Misses counter.

Cycles Per Instruction (CPI) is one of the most widely used statistics for characterizing the overall performance of algorithmic benchmarks. It denotes how many clock cycles were spent on average by the CPU to execute each instruction during a given period of the algorithm. While this is not exposed as a single CPU performance counter, it can be derived from the Instructions Retired and Non Halted Cycles counters.

Clearly this metric is highly dependent on hardware (CPU model and clock speed, cache architecture, memory speed, and so on), and CPI numbers are normally quoted alongside these hardware characteristics in order to provide insight into how efficiently an algorithm performs on particular hardware.

In diagnosing false sharing, we can rely on the fact that a CPI measurement includes the impact of memory access latencies and the overhead of cross-processor cache synchronization (both of which cause the CPU to sit idle, wasting valuable clock cycles) in addition to the actual clock cycles that the CPU spends executing instructions. This seems to fit the task at hand nicely, because everything else being equal, false sharing effectively causes an increase in both cross-processor communication overheads and accesses to the main memory. Therefore, the same operation being completed in the presence of false sharing will lead to an increased CPI number.

It's also interesting to note that, unlike the L2 Cache Misses counter, the CPI metric is independent from the execution time. Therefore it can be used to make comparisons between runs with different durations.

Using the Visual Studio Profiler

In Visual Studio, L2 misses can be tracked in both sampling and instrumentation-based profiling modes, while CPI requires instrumentation-based profiling because it involves using two CPU counters.

To profile for L2 misses in sampling mode, first create a sampling-based performance session in Visual Studio for your project. Then use the session's properties to change the sampling event to Performance Counter, and select either L2 Misses from the Portable Events > Memory Events category or L2 Lines In from the Platform Events > L2 Cache category. (On some processors, L2 Misses will be marked as unsupported.)

Now you can profile your application. The reported numbers will be based on profiling samples taken every 1M L2 miss events. This means that when sorted by exclusive samples, the functions causing most L2 misses will be on top.

To profile for L2 misses and CPI in instrumentation mode, start with an instrumentation-based performance session. In the properties of this session, enable collection of CPU counters and select the CPU counters of interest: the L2 Cache Misses event as described earlier, along with the Instructions Retired and Non Halted Cycles from the Portable Events > General category.

The reports generated by the profiler will include a number of new columns showing the inclusive and exclusive measurements for each of these CPU counters, as well as the regular elapsed time-related metrics. The most useful information for our purposes is to be found in the per function exclusive values for each of these counters.

It is natural to receive a lot of L2 misses in functions that access a wide range of memory, especially if it is larger than or comparable to the L2 cache size. But if a particular function causes a significant number of L2 misses even though it is known to access only a small memory range, there's good reason to suspect that those misses are triggered by shared cache lines. This simple test can be applied to both instrumentation and sampling-based profiling data. However, since there is no easy way to tell what kind of results to expect for a good case, this is best applied comparatively, such as by cross-referencing runs of the same algorithm under different settings and concurrency levels.

You can also try more advanced approaches. For the instrumented profiling mode, the exact number of L2 misses incurred for a given function will be reported under the L2 Misses Exclusive column. If you have a rough idea of how much memory is referenced from this function you can also use these numbers to estimate how L2 misses compare against the total expected memory reads. However, this technique requires careful attention to the architecture details such as cache line size and the actual behavior of the CPU counter being used.

To illustrate, the L2 miss statistics collected for a test profiling session are shown in Figure 5. The application under test runs four threads that repeatedly traverse and update an array of structs whose total size in memory is about 100KB. The test machine had four CPUs (dual-socket, dual-core), with a total of 4MB L2 cache.

Figure 5 L2 Miss Statistics for a Test Session

Test Scenario Total Amount of Memory Reads Size of Unique Memory Addresses Referenced "Exclusive L2 Lines In" for the Hot Spot Function
With False Sharing (accesses 4 bytes apart) ~1GB 100KB 310,000,000
Reduced False Sharing (accesses 16 bytes apart) ~1GB 100KB 95,000,000
Without False Sharing (well separated) ~1GB 100KB 240,000

In this example, we can identify the false sharing cases from the L2 statistics without even having to consider architectural details for a precise calculation. It's clear that a significant fraction of reads have resulted in L2 misses. While the application is referencing only 100KB of unique memory, which would fit into the L2 cache with ample room to spare, the reported number of L2 Lines In is approaching the order of total memory references.

Sometimes frequent, short function calls can cause instrumented profiling to introduce enough overhead to perturb runtime behavior and, therefore, profiling results. In those cases, it is a better idea to rely on sampling-based profiling to collect L2 statistics. The example shown inFigures 3 and 4 was such a case due to the large number of Random.Next calls involved. When we profiled this example with sampling, we got the results shown in Figure 6. Although not as dramatic as the results shown in Figure 5, our hint for false sharing here is that the L2 miss counts clearly indicate a considerable fraction of the Random.Next calls (3.5M / 40M) leading to cache reloads while the referenced data is known to remain completely static in memory from each thread's point of view. The sampling results for the corrected case, also shown in Figure 6, support this claim.

Test Scenario Total Calls to Random.Next "Exclusive L2 Lines In" Top Hits (configured for one sampling hit per 1000 L2 events)
With False Sharing (Figure 3) 40M (4x10M) System.Random.InternalSample -- 3,525
Program.DoWork -- 1,828
System.Random.Next -- 271
[mscorwks.dll] -- 161
[ntdll.dll] -- 78
Without False Sharing (Figure 4) 40M (4x10M) [mscorwks.dll] -- 101
[ntdll.dll] -- 76
System.Random.InternalSample -- 53

The CPI metric for a specific function can be calculated as the ratio of two performance counters: Exclusive Non Halted Cycles and Exclusive Instructions Retired. This requires the use of two separate sampling-based profiling sessions, since only one of these counters can be profiled in a given run.

As we mentioned earlier, CPI for a given algorithm will be highly dependent on the hardware it's running on. Therefore, it's important to make comparisons against similar hardware only. One way of using this metric is to compare different runs of the same algorithm with various concurrency levels. This can be helpful in cases where it's difficult to compare L2 statistics as a result of execution time differences when concurrency levels are lowered, because CPI is independent of the running time.

If the same algorithm yields a noticeably smaller CPI value when run on a smaller number of CPUs with the same input data, it implies the algorithm is using the hardware less efficiently for high concurrency levels. If the CPI metric was focused on a mostly computational piece of the code, such that software synchronization effects are excluded, then this is a good indication that the reduction in hardware efficiency is potentially due to cache effects.

Is False Sharing Just Theoretical?

The description of the false sharing problem sounds rather esoteric, something that you will never encounter in practice. However, this problem is real and occurs surprisingly often in real-world scenarios.

While it is possible to write your parallel application without thinking about false sharing, depending on performance testing to detect such problems, it is generally a good idea to keep the false sharing issue in mind throughout the design and implementation phases of your application development. Since you are bothering to write a parallel application, presumably you care about performance. It would be a shame to go though the effort of designing your application so that it scales nicely across multiple cores, and then see much of the parallel speed-up wiped away due to false sharing.

Another thing to keep in mind is that the false sharing problem is not always easy to catch via testing. The behavior is somewhat nondeterministic, may vary across different hardware (due to differences such as cache line sizes), and can change drastically after innocent changes to the code. In fact, many optimizations you might be used to implementing to improve application performance when writing single-threaded applications can actually lead to worse false sharing problems in multithreaded applications. If you are developing an application that demands high performance, thinking about false sharing early in the process will likely pay off in the end.

Memory Locations at Risk of False Sharing

False sharing only becomes a problem if the inefficiencies it introduces dominate the computation time. If a cached value is unnecessarily invalidated only once in a while, the effect on the overall execution time likely will be negligible. But if a thread spends much of its time reading from and writing to a particular memory location, and the cached copy of that memory location keeps getting invalidated by another thread due to false sharing, then resolving the false sharing issue is likely to significantly improve the performance of the application.

Due to the principle of locality, programs often spend much of their time accessing memory locations that are close to each other. For example, consider a common parallel-computing scenario: several threads are cooperatively computing a result. Each thread computes an intermediate result, and intermediate results from different threads are combined to produce the final result. If each thread spends much of the time reading and writing its intermediate result, and the intermediate results from different threads are close in memory and fit on the same cache line, false sharing may significantly impact the performance of the algorithm.

A simple example fitting this pattern is computing the sum of integers in an array in parallel. Each thread processes a section of the array, adding up the elements in that section into an accumulator. If the accumulators from different threads are located on the same cache line, performance is likely going to suffer as threads spend much of their time stomping over each other's cache, in effect serializing the execution of the application at the cache level.

As an example of a more complex real-world situation, we hit the false sharing issue when working on the Parallel LINQ (PLINQ) project. PLINQ is a LINQ provider similar to LINQ-to-Objects, except that it distributes the work of evaluating the query across all cores available on the machine. Without getting into too much detail, PLINQ can represent a parallel computation using a set of enumerators (classes implementing the IEnumerator<T> interface). When iterating, each enumerator walks over a partition of the input and computes corresponding outputs. Different enumerators are moved forward by different threads and, as a result, the computation is distributed across multiple threads. If multiple cores are available on which to schedule the threads, parallel speedup will likely be achieved.

As it turns out, this scenario is susceptible to false sharing. We have several threads, each of which will be reading and modifying fields on its own enumerator. If the enumerators end up close in memory—and in a naive implementation where the enumerators are created and then handed out to threads (as in the Random example earlier), they are likely to—false sharing will significantly eat into the parallel speed-up.

Avoid Contiguous Allocation of Memory

While it is true that the .NET memory allocator hides many of the details of memory layout, in many cases we can make good guesses about certain properties of the layout, or even know some of them for sure. We can assume the following:

  • Two instance fields in the same class instance are close in memory location.
  • Two static fields in the same type are close in memory.
  • Two elements with adjacent indices in an array are adjacent in memory.
  • Objects allocated consecutively are likely to be close in memory, and may get closer over time.
  • Local variables used together in a closure are likely captured in instance fields, and thus, from the first comment above, are likely close in memory.
  • Local variables u

In other cases, we know that two memory locations will likely be far apart. For example, objects allocated by different threads are likely to be far apart in memory.

If a thread allocates one object, then a number of medium or large objects, and then another object, the first object and the last object are likely to be separated by the distance equal to the sum of the objects allocated between them. This behavior is not guaranteed, but it often occurs in practice. (However, if the objects in the middle are too large, they may end up in completely different heaps, in which case the first and last object will likely be close in memory.)

If two objects are padded with unused fields whose total size is at least the size of the cache line, then the two objects will not end up on the same cache line. The padding fields should come last in memory. If we know the order in which objects appear in memory, it is sufficient to pad just the one that comes first.

Based on these assumptions, we can design a parallel application to minimize the number of false sharing issues that we would otherwise have to debug later.

There is hope that future programming models will hide some of this complexity from the user. In the near term, however, understanding how memory and caches work will continue to be important for writing efficient parallel programs.

Send your questions and comments to netqa@microsoft.com.


Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team at Microsoft. He is also a Contributing Editor for MSDN Magazine.

Igor Ostrovsky is a Software Development Engineer on the Parallel Computing Platform team at Microsoft. He is the primary developer for Parallel LINQ (PLINQ).

Huseyin Yildiz is a Software Development Engineer on the Parallel Computing Platform team at Microsoft. He works on the Task Parallel Library (TPL).