October 2011

Volume 26 Number 10

Asynchronous Programming - Async Performance: Understanding the Costs of Async and Await

By Stephen Toub | October 2011

Asynchronous programming has long been the realm of only the most skilled and masochistic of developers—those with the time, inclination and mental capacity to reason about callback after callback of non-linear control flow. With the Microsoft .NET Framework 4.5, C# and Visual Basic deliver asynchronicity for the rest of us, such that mere mortals can write asynchronous methods almost as easily as writing synchronous methods. No more callbacks. No more explicit marshaling of code from one synchronization context to another. No more worrying about the flowing of results or exceptions. No more tricks that contort existing language features to ease async development. In short, no more hassle.

Of course, while it’s now easy to get started writing asynchronous methods (see the articles by Eric Lippert and Mads Torgersen in this issue of MSDN Magazine), doing it really well still requires an understanding of what’s happening under the covers. Any time a language or framework raises the level of abstraction at which a developer can program, it invariably also encapsulates hidden performance costs. In many cases, such costs are negligible and can and should be ignored by the vast number of developers implementing the vast number of scenarios. However, it still behooves more advanced developers to really understand what costs exist so they can take any necessary steps to avoid those costs if they do eventually become visible. Such is the case with the asynchronous methods feature in C# and Visual Basic.

In this article, I’ll explore the ins and outs of asynchronous methods, providing you with a solid understanding of how asynchronous methods are implemented under the covers and discussing some of the more nuanced costs involved. Note that this information isn’t meant to encourage you to contort readable code into something that can’t be maintained, all in the name of micro-optimization and performance. It’s simply to give you information that may help you diagnose any problems you may run across, as well as supply a set of tools to help you overcome such potential issues. Note also that this article is based on a preview release of the .NET Framework 4.5, and it’s likely that specific implementation details will change prior to the final release.

Getting the Right Mental Model

For decades, developers have used high-level languages like C#, Visual Basic, F# and C++ to develop efficient applications. This experience has informed those developers about the relevant costs of various operations, and that knowledge has informed best development practices. For example, for most use cases, calling a synchronous method is relatively cheap, even more so when the compiler is able to inline the callee into the call site. Thus, developers learn to refactor code into small, maintainable methods, in general without needing to think about any negative ramifications from the increased method invocation count. These developers have a mental model for what it means to call a method.

With the introduction of asynchronous methods, a new mental model is needed. While the C# and Visual Basic languages and compilers are able to provide the illusion of an asynchronous method being just like its synchronous counterpart, under the covers it’s no such thing. The compiler ends up generating a lot of code on behalf of the developer, code akin to the quantities of boilerplate code that developers implementing asynchronicity in days of yore would’ve had to have written and maintained by hand. Further still, the compiler-generated code calls into library code in the .NET Framework, again increasing the work done on behalf of the developer. To get the right mental model, and then to use that mental model to make appropriate development decisions, it’s important to understand what the compiler is generating on your behalf.

Think Chunky, Not Chatty

When working with synchronous code, methods with empty bodies are practically free. This is not the case for asynchronous methods. Consider the following asynchronous method, which has a single statement in its body (and which due to lack of awaits will end up running synchronously):

public static async Task SimpleBodyAsync() {
  Console.WriteLine("Hello, Async World!");
}

An intermediate language (IL) decompiler will reveal the true nature of this function once compiled, with output similar to what’s shown in Figure 1. What was a simple one-liner has been expanded into two methods, one of which exists on a helper state machine class. First, there’s a stub method that has the same basic signature as that written by the developer (the method is named the same, it has the same visibility, it accepts the same parameters and it retains its return type), but that stub doesn’t contain any of the code written by the developer. Rather, it contains setup boilerplate. The setup code initializes the state machine used to represent the asynchronous method and then kicks it off using a call to the secondary MoveNext method on the state machine. This state machine type holds state for the asynchronous method, allowing that state to be persisted across asynchronous await points, if necessary. It also contains the body of the method as written by the user, but contorted in a way that allows for results and exceptions to be lifted into the returned Task; for the current position in the method to be maintained so that execution may resume at that location after an await; and so on.

Figure 1 Asynchronous Method Boilerplate

[DebuggerStepThrough]     
public static Task SimpleBodyAsync() {
  <SimpleBodyAsync>d__0 d__ = new <SimpleBodyAsync>d__0();
  d__.<>t__builder = AsyncTaskMethodBuilder.Create();
  d__.MoveNext();
  return d__.<>t__builder.Task;
}
 
[CompilerGenerated]
[StructLayout(LayoutKind.Sequential)]
private struct <SimpleBodyAsync>d__0 : <>t__IStateMachine {
  private int <>1__state;
  public AsyncTaskMethodBuilder <>t__builder;
  public Action <>t__MoveNextDelegate;
 
  public void MoveNext() {
    try {
      if (this.<>1__state == -1) return;
      Console.WriteLine("Hello, Async World!");
    }
    catch (Exception e) {
      this.<>1__state = -1;
      this.<>t__builder.SetException(e);
      return;
    }
 
    this.<>1__state = -1;
    this.<>t__builder.SetResult();
  }
 
  ...
}

When thinking through what asynchronous methods cost to invoke, keep this boilerplate in mind. The try/catch block in the MoveNext method will likely prevent it from getting inlined by the just-in-time (JIT) compiler, so at the very least we’ll now have the cost of a method invocation where in the synchronous case we likely would not (with such a small method body). We have multiple calls into Framework routines (like SetResult). And we have multiple writes to fields on the state machine type. Of course, we need to weigh all of this against the cost of the Console.WriteLine, which will likely dominate all of the other costs involved (it takes locks, it does I/O and so forth). Further, notice that there are optimizations the infrastructure does for you. For example, the state machine type is a struct. That struct will only be boxed to the heap if this method ever needs to suspend its execution because it’s awaiting an instance that’s not yet completed, and in this simple method, it never will complete. As such, the boilerplate of this asynchronous method won’t incur any allocations. The compiler and runtime work hard together to minimize the number of allocations involved in the infrastructure.

Know When Not to Use Async

The .NET Framework attempts to generate efficient asynchronous implementations for asynchronous methods, applying multiple optimizations. However, developers often have domain knowledge than can yield optimizations that would be risky and unwise for the compiler and runtime to apply automatically, given the generality they target. With this in mind, it can actually benefit a developer to avoid using async methods in a certain, small set of use cases, particularly for library methods that will be accessed in a more fine-grained manner. Typically, this is the case when it’s known that the method may actually be able to complete synchronously because the data it’s relying on is already available.

When designing asynchronous methods, the Framework developers spent a lot of time optimizing away object allocations. This is because allocations represent one of the largest performance costs possible in the asynchronous method infrastructure. The act of allocating an object is typically quite cheap. Allocating objects is akin to filling your shopping cart with merchandise, in that it doesn’t cost you much effort to put items into your cart; it’s when you actually check out that you need to pull out your wallet and invest significant resources. While allocations are usually cheap, the resulting garbage collection can be a showstopper when it comes to the application’s performance. The act of garbage collection involves scanning through some portion of objects currently allocated and finding those that are no longer referenced. The more objects allocated, the longer it takes to perform this marking. Further, the larger the allocated objects and the more of them that are allocated, the more frequently garbage collection needs to occur. In this manner, then, allocations have a global effect on the system: the more garbage generated by asynchronous methods, the slower the overall program will run, even if micro benchmarks of the asynchronous methods themselves don’t reveal significant costs.

For asynchronous methods that actually yield execution (due to awaiting an object that’s not yet completed), the asynchronous method infrastructure needs to allocate a Task object to return from the method, as that Task serves as a unique reference for this particular invocation. However, many asynchronous method invocations can complete without ever yielding. In such cases, the asynchronous method infrastructure may return a cached, already completed Task, one that it can use over and over to avoid allocating unnecessary Tasks. It’s only able to do this in limited circumstances, however, such as when the asynchronous method is a non-generic Task, a Task<Boolean>, or when it’s a Task<TResult> where TResult is a reference type and the result of the asynchronous method is null. While this set may expand in the future, you can often do better if you have domain knowledge of the operation being implemented.

Consider implementing a type like MemoryStream. MemoryStream derives from Stream, and thus can override Stream’s new .NET 4.5 ReadAsync, WriteAsync and FlushAsync methods to provide optimized implementations for the nature of MemoryStream. Because the operation of reading is simply going against an in-memory buffer and is therefore just a memory copy, better performance results if ReadAsync runs synchronously. Implementing this with an asynchronous method would look something like the following:

public override async Task<int> ReadAsync(
  byte [] buffer, int offset, int count,
  CancellationToken cancellationToken)
{
  cancellationToken.ThrowIfCancellationRequested();
  return this.Read(buffer, offset, count);
}

Easy enough. And because Read is a synchronous call, and because there are no awaits in this method that will yield control, all invocations of ReadAsync will actually complete synchronously. Now, let’s consider a standard usage pattern of streams, such as a copy operation:

byte [] buffer = new byte[0x1000];
int numRead;
while((numRead = await source.ReadAsync(buffer, 0, buffer.Length)) > 0) {
  await source.WriteAsync(buffer, 0, numRead);
}

Notice here that ReadAsync on the source stream for this particular series of calls is always invoked with the same count parameter (the buffer’s length), and thus it’s very likely that the return value (the number of bytes read) will also be repeating. Except in some rare circumstances, it’s very unlikely that the asynchronous method implementation of ReadAsync will be able to use a cached Task for its return value, but you can.

Consider rewriting this method as shown in Figure 2. By taking advantage of the specific aspects of this method and its common usage scenarios, we’ve now been able to optimize allocations away on the common path in a way we couldn’t expect the underlying infrastructure to do. With this, every time a call to ReadAsync retrieves the same number of bytes as the previous call to ReadAsync, we’re able to completely avoid any allocation overhead from the ReadAsync method by returning the same Task we returned on the previous invocation. And for a low-level operation like this that we expect to be very fast and to be invoked repeatedly, such an optimization can make a noticeable difference, especially in the number of garbage collections that occur.

Figure 2 Optimizing Task Allocations

private Task<int> m_lastTask;
 
public override Task<int> ReadAsync(
  byte [] buffer, int offset, int count,
  CancellationToken cancellationToken)
{
  if (cancellationToken.IsCancellationRequested) {
    var tcs = new TaskCompletionSource<int>();
    tcs.SetCanceled();
    return tcs.Task;
  }
 
  try {
      int numRead = this.Read(buffer, offset, count);
      return m_lastTask != null && numRead == m_lastTask.Result ?
        m_lastTask : (m_lastTask = Task.FromResult(numRead));
  }
  catch(Exception e) {
    var tcs = new TaskCompletionSource<int>();
    tcs.SetException(e);
    return tcs.Task;
  }
}

A related optimization to avoid the task allocation may be done when the scenario dictates caching. Consider a method whose purpose it is to download the contents of a particular Web page and then cache its successfully downloaded contents for future accesses. Such functionality might be written using an asynchronous method as follows (using the new System.Net.Http.dll library in .NET 4.5):

private static ConcurrentDictionary<string,string> s_urlToContents;
 
public static async Task<string> GetContentsAsync(string url)
{
  string contents;
  if (!s_urlToContents.TryGetValue(url, out contents))
  {
    var response = await new HttpClient().GetAsync(url);
    contents = response.EnsureSuccessStatusCode().Content.ReadAsString();
    s_urlToContents.TryAdd(url, contents);
  }
  return contents;
}

This is a straightforward implementation. And for calls to GetContentsAsync that can’t be satisfied from the cache, the overhead of constructing a new Task<string> to represent this download will be negligible when compared to the network-related costs. However, for cases where the contents may be satisfied from the cache, it could represent a non-negligible cost, an object allocation simply to wrap and hand back already available data.

To avoid that cost (if doing so is required to meet your performance goals), you could rewrite this method as shown in Figure 3. We now have two methods: a synchronous public method, and an asynchronous private method to which the public method delegates. The dictionary is now caching the generated tasks rather than their contents, so future attempts to download a page that’s already been successfully downloaded can be satisfied with a simple dictionary access to return an already existing task. Internally, we also take advantage of the ContinueWith methods on Task that allow us to store the task into the dictionary once the Task has completed—but only if the download succeeded. Of course, this code is more complicated and requires more thought to write and maintain, so as with any performance optimizations, avoid spending time making them until performance testing proves that the complications make an impactful and necessary difference. Whether such optimizations make a difference really depends on usage scenarios. You’ll want to come up with a suite of tests that represent common usage patterns, and use analysis of those tests to determine whether these complications improve your code’s performance in a meaningful way.

Figure 3 Manually Caching Tasks

private static ConcurrentDictionary<string,Task<string>> s_urlToContents;
 
public static Task<string> GetContentsAsync(string url) {
  Task<string> contents;
  if (!s_urlToContents.TryGetValue(url, out contents)) {
      contents = GetContentsInternalAsync(url);
      contents.ContinueWith(delegate {
        s_urlToContents.TryAdd(url, contents);
      }, CancellationToken.None,
        TaskContinuationOptions.OnlyOnRanToCompletion |
         TaskContinuatOptions.ExecuteSynchronously,
        TaskScheduler.Default);
  }
  return contents;
}
 
private static async Task<string> GetContentsInternalAsync(string url) {
  var response = await new HttpClient().GetAsync(url);
  return response.EnsureSuccessStatusCode().Content.ReadAsString();
}

Another task-related optimization to consider is whether you even need the returned Task from an asynchronous method. C# and Visual Basic both support the creation of asynchronous methods that return void, in which case no Task is allocated for the method, ever. Asynchronous methods exposed publicly from libraries should always be written to return a Task or Task<TResult>, because you as a library developer don’t know whether the consumer desires to wait on the completion of that method. However, for certain internal usage scenarios, void-returning asynchronous methods can have their place. The primary reason void-returning asynchronous methods exist is to support existing event-driven environments, like ASP.NET and Windows Presentation Foundation (WPF). They make it easy to implement button handlers, page-load events and the like through the use of async and await. If you do consider using an async void method, be very careful around exception handling: exceptions that escape an async void method bubble out into whatever SynchronizationContext was current at the time the async void method was invoked.

Care About Context

There are many kinds of “context” in the .NET Framework: LogicalCallContext, SynchronizationContext, HostExecutionContext, SecurityContext, ExecutionContext and more (from the sheer number you might expect that the developers of the Framework are monetarily incentivized to introduce new contexts, but I assure you we’re not). Some of these contexts are very relevant to asynchronous methods, not only in functionality, but also in their impact on asynchronous method performance. 

SynchronizationContext SynchronizationContext plays a big role in asynchronous methods. A “synchronization context” is simply an abstraction over the ability to marshal delegate invocation in a manner specific to a given library or framework. For example, WPF provides a DispatcherSynchronizationContext to represent the UI thread for a Dispatcher: posting a delegate to this synchronization context causes that delegate to be queued for execution by the Dispatcher on its thread. ASP.NET provides an AspNetSynchronizationContext, which is used to ensure that asynchronous operations that occur as part of the processing of an ASP.NET request are executed serially and are associated with the right HttpContext state. And so on. All told, there are around 10 concrete implementations of SynchronizationContext within the .NET Framework, some public, some internal.

When awaiting Tasks and other awaitable types provided by the .NET Framework, the “awaiters” for those types (like TaskAwaiter) capture the current SynchronizationContext at the time the await is issued. Upon completion of the awaitable, if there was a current SynchronizationContext that got captured, the continuation representing the remainder of the asynchronous method is posted to that SynchronizationContext. With that, developers writing an asynchronous method called from a UI thread don’t need to manually marshal invocations back to the UI thread in order to modify UI controls: such marshaling is handled automatically by the Framework infrastructure.

Unfortunately, such marshaling also involves cost. For application developers using await to implement their control flow, this automatic marshaling is almost always the right solution. Libraries, however, are often a different story. Application developers typically need such marshaling because their code cares about the context under which it’s running, such as being able to access UI controls, or being able to access the HttpContext for the right ASP.NET request. Most libraries, however, do not suffer this constraint. As a result, this automatic marshaling is frequently an entirely unnecessary cost. Consider again the code shown earlier to copy data from one stream to another:

byte [] buffer = new byte[0x1000];
int numRead;
while((numRead = await source.ReadAsync(buffer, 0, buffer.Length)) > 0) {
  await source.WriteAsync(buffer, 0, numRead);
}

If this copy operation is invoked from a UI thread, every awaited read and write operation will force the completion back to the UI thread. For a megabyte of source data and Streams that complete reads and writes asynchronously (which is most of them), that means upward of 500 hops from background threads to the UI thread. To address this, the Task and Task<TResult> types provide a ConfigureAwait method. ConfigureAwait accepts a Boolean continueOnCapturedContext parameter that controls this marshaling behavior. If the default of true is used, the await will automatically complete back on the captured SynchronizationContext. If false is used, however, the SynchronizationContext will be ignored and the Framework will attempt to continue the execution wherever the previous asynchronous operation completed. Incorporating this into the stream-copying code results in the following more efficient version:

byte [] buffer = new byte[0x1000];
int numRead;
while((numRead = await
  source.ReadAsync(buffer, 0, buffer.Length).ConfigureAwait(false)) > 0) {
  await source.WriteAsync(buffer, 0, numRead).ConfigureAwait(false);
}

For library developers, this performance impact alone is sufficient to warrant always using ConfigureAwait, unless it’s the rare circumstance where the library has domain knowledge of its environment and does need to execute the body of the method with access to the correct context.

There’s another reason, beyond performance, to use ConfigureAwait in library code. Suppose the preceding code, without ConfigureAwait, was in a method called CopyStreamToStreamAsync, which was invoked from a WPF UI thread, like so:

private void button1_Click(object sender, EventArgs args) {
  Stream src = …, dst = …;
  Task t = CopyStreamToStreamAsync(src, dst);
  t.Wait(); // deadlock!
}

Here, the developer should have written button1_Click as an async method and then await-ed the Task instead of using its synchronous Wait method. The Wait method has its important uses, but it’s almost always wrong to use it for waiting in a UI thread like this. The Wait method won’t return until the Task has completed. In the case of CopyStreamToStreamAsync, the contained awaits try to Post back to the captured SynchronizationContext, and the method can’t complete until those Posts complete (because the Posts are used to process the remainder of the method). But those Posts won’t complete, because the UI thread that would process them is blocked in the call to Wait. This is a circular dependency, resulting in a deadlock. If CopyStreamToStreamAsync had instead been written using ConfigureAwait(false), there would be no circular dependency and no deadlock.

ExecutionContext ExecutionContext is an integral part of the .NET Framework, yet most developers are blissfully unaware of its existence. ExecutionContext is the granddaddy of contexts, encapsulating multiple other contexts like SecurityContext and LogicalCallContext, and representing everything that should be automatically flowed across asynchronous points in code. Any time you’ve used ThreadPool.QueueUserWorkItem, Task.Run, Delegate.BeginInvoke, Stream.BeginRead, WebClient.DownloadStringAsync or any other asynchronous operation in the Framework, under the covers ExecutionContext was captured if possible (via ExecutionContext.Capture), and that captured context was then used to process the provided delegate (via ExecutionContext.Run). For example, if the code invoking ThreadPool.QueueUserWorkItem was impersonating a Windows identity at the time, that same Windows identity would be impersonated in order to run the supplied WaitCallback delegate. And if the code invoking Task.Run had first stored data into the LogicalCallContext, that same data would be accessible through the LogicalCallContext within the supplied Action delegate. ExecutionContext is also flowed across awaits on tasks.

There are multiple optimizations in place in the Framework to avoid capturing and running under a captured ExecutionContext when doing so is unnecessary, as doing so can be quite expensive. However, actions like impersonating a Windows identity or storing data into LogicalCallContext will thwart these optimizations. Avoiding operations that manipulate ExecutionContext, such as WindowsIdentity.Impersonate and CallContext.LogicalSetData, results in better performance when using asynchronous methods, and when using asynchrony in general.

Lift Your Way out of Garbage Collection

Asynchronous methods provide a nice illusion when it comes to local variables. In a synchronous method, local variables in C# and Visual Basic are stack-based, such that no heap allocations are necessary to store those locals. However, in asynchronous methods, the stack for the method goes away when the asynchronous method is suspending at an await point. For data to be available to the method after an await resumes, that data must be stored somewhere. Thus, the C# and Visual Basic compilers “lift” locals into a state machine struct, which is then boxed to the heap at the first await that suspends so that locals may survive across await points.

Earlier in this article, I discussed how the cost and frequency of garbage collection is influenced by the number of objects allocated, while the frequency of garbage collection is also influenced by the size of objects allocated. The bigger the objects being allocated, the more often garbage collection will need to run. Thus, in an asynchronous method, the more locals that need to be lifted to the heap, the more often garbage collections will occur.

As of the time of this writing, the C# and Visual Basic compilers sometimes lift more than is truly necessary. For example, consider the following code snippet:

public static async Task FooAsync() {
  var dto = DateTimeOffset.Now;
  var dt  = dto.DateTime;
  await Task.Yield();
  Console.WriteLine(dt);
}

The dto variable isn’t read at all after the await point, and thus the value written to it before the await doesn’t need to survive across the await. However, the state machine type generated by the compiler to store locals still contains the dto reference, as shown in Figure 4.

Figure 4 Local Lifting

[StructLayout(LayoutKind.Sequential), CompilerGenerated]
private struct <FooAsync>d__0 : <>t__IStateMachine {
  private int <>1__state;
  public AsyncTaskMethodBuilder <>t__builder;
  public Action <>t__MoveNextDelegate;
 
  public DateTimeOffset <dto>5__1;
  public DateTime <dt>5__2;
  private object <>t__stack;
  private object <>t__awaiter;
 
  public void MoveNext();
  [DebuggerHidden]
  public void <>t__SetMoveNextDelegate(Action param0);
}

This slightly bloats the size of that heap object beyond what’s truly necessary. If you find that garbage collections are occurring more frequently than you expect, take a look at whether you really need all of the temporary variables you’ve coded into your asynchronous method. This example could be rewritten as follows to avoid the extra field on the state machine class:

public static async Task FooAsync() {
  var dt = DateTimeOffset.Now.DateTime;
  await Task.Yield();
  Console.WriteLine(dt);
}

Moreover, the .NET garbage collector (GC) is a generational collector, meaning that it partitions the set of objects into groups, known as generations: at a high-level, new objects are allocated in generation 0, and then all objects that survive a collection are promoted up a generation (the .NET GC currently uses generations 0, 1 and 2). This enables faster collections by allowing the GC to frequently collect only from a subset of the known object space. It’s based on the philosophy that objects newly allocated will also go away quickly, while objects that have been around for a long time will continue to be around for a long time. What this means is that if an object survives generation 0, it will likely end up being around for a while, continuing to put pressure on the system for that additional time. And that means we really want to ensure that objects are made available to garbage collection as soon as they’re no longer needed.

With the aforementioned lifting, locals get promoted to fields of a class that stays rooted for the duration of the asynchronous method’s execution (as long as the awaited object properly maintains a reference to the delegate to invoke upon completion of the awaited operation). In synchronous methods, the JIT compiler is able to keep track of when locals will never again be accessed, and at such points can help the GC to ignore those variables as roots, thus making the referenced objects available for collection if they’re not referenced anywhere else. However, in asynchronous methods, these locals remain referenced, which means the objects they reference may survive much longer than if these had been real locals. If you find that objects are remaining alive well past their use, consider nulling out the locals referencing those objects when you’re done with them. Again, this should be done only if you find that it’s actually the cause of a performance problem, as it otherwise complicates the code unnecessarily. Furthermore, the C# and Visual Basic compilers could be updated by final release or otherwise in the future to handle more of these scenarios on the developer’s behalf, so any such code written today is likely to become obsolete in the future.

Avoid Complexity

The C# and Visual Basic compilers are fairly impressive in terms of where you’re allowed to use awaits: almost anywhere. Await expressions may be used as part of larger expressions, allowing you to await Task<TResult> instances in places you might have any other value-returning expression. For example, consider the following code, which returns the sum of three tasks’ results:

public static async Task<int> SumAsync(
  Task<int> a, Task<int> b, Task<int> c)
{
  return Sum(await a, await b, await c);
}
 
private static int Sum(int a, int b, int c)
{
  return a + b + c;
}

The C# compiler allows you to use the expression “await b” as an argument to the Sum function. However, there are multiple awaits here whose results are passed as parameters to Sum, and due to order of evaluation rules and how async is implemented in the compiler, this particular example requires the compiler to “spill” the temporary results of the first two awaits. As you saw previously, locals are preserved across await points by having them lifted into fields on the state machine class. However, for cases like this one, where the values are on the CLR evaluation stack, those values aren’t lifted into the state machine but are instead spilled to a single temporary object and then referenced by the state machine. When you complete the await on the first task and go to await the second one, the compiler generates code that boxes the first result and stores the boxed object into a single <>t__stack field on the state machine. When you complete the await on the second task and go to await the third one, the compiler generates code that creates a Tuple<int,int> from the first two values, storing that tuple into the same <>__stack field. This all means that, depending on how you write your code, you could end up with very different allocation patterns. Consider instead writing SumAsync as follows:

public static async Task<int> SumAsync(
  Task<int> a, Task<int> b, Task<int> c)
{
  int ra = await a;
  int rb = await b;
  int rc = await c;
  return Sum(ra, rb, rc);
}

With this change, the compiler will now emit three more fields onto the state machine class to store ra, rb and rc, and no spilling will occur. Thus, you have a trade-off: a larger state machine class with fewer allocations, or a smaller state machine class with more allocations. The total amount of memory allocated will be larger in the spilling case, as each object allocated has its own memory overhead, but in the end performance testing could reveal that’s still better. In general, as mentioned previously, you shouldn’t think through these kinds of micro-optimizations unless you find that the allocations are actually the cause of grief, but regardless, it’s helpful to know where these allocations are coming from.

Of course, there’s arguably a much larger cost in the preceding examples that you should be aware of and proactively consider. The code isn’t able to invoke Sum until all three awaits have completed, and no work is done in between the awaits. Each of these awaits that yields requires a fair amount of work, so the fewer awaits you need to process, the better. It would behoove you, then, to combine all three of these awaits into just one by waiting on all of the tasks at once with Task.WhenAll:

public static async Task<int> SumAsync(
  Task<int> a, Task<int> b, Task<int> c)
{
  int [] results = await Task.WhenAll(a, b, c);
  return Sum(results[0], results[1], results[2]);
}

The Task.WhenAll method here returns a Task<TResult[]> that won’t complete until all of the supplied tasks have completed, and it does so much more efficiently than just waiting on each individual task. It also gathers up the result from each task and stores it into an array. If you want to avoid that array, you can do that by forcing binding to the non-generic WhenAll method that works with Task instead of Task<TResult>. For ultimate performance, you could also take a hybrid approach, where you first check to see if all of the tasks have completed successfully, and if they have, get their resultsindividually—but if they haven’t, then await a WhenAll of those that haven’t. That will avoid any allocations involved in the call to WhenAll when it’s unnecessary, such as allocating the params array to be passed into the method. And, as previously mentioned, we’d want this library function to also suppress context marshaling. Such a solution is shown in Figure 5.

Figure 5 Applying Multiple Optimizations

public static Task<int> SumAsync(
  Task<int> a, Task<int> b, Task<int> c)
{
  return (a.Status == TaskStatus.RanToCompletion &&
          b.Status == TaskStatus.RanToCompletion &&
          c.Status == TaskStatus.RanToCompletion) ?
    Task.FromResult(Sum(a.Result, b.Result, c.Result)) :
    SumAsyncInternal(a, b, c);
}
 
private static async Task<int> SumAsyncInternal(
  Task<int> a, Task<int> b, Task<int> c)
{
  await Task.WhenAll((Task)a, b, c).ConfigureAwait(false);
  return Sum(a.Result, b.Result, c.Result);
}

Asynchronicity and Performance

Asynchronous methods are a powerful productivity tool, enabling you to more easily write scalable and responsive libraries and applications. It’s important to keep in mind, though, that asynchronicity is not a performance optimization for an individual operation. Taking a synchronous operation and making it asynchronous will invariably degrade the performance of that one operation, as it still needs to accomplish everything that the synchronous operation did, but now with additional constraints and considerations. A reason you care about asynchronicity, then, is performance in the aggregate: how your overall system performs when you write everything asynchronously, such that you can overlap I/O and achieve better system utilization by consuming valuable resources only when they’re actually needed for execution. The asynchronous method implementation provided by the .NET Framework is well-optimized, and often ends up providing as good or better performance than well-written asynchronous implementations using existing patterns and volumes more code. Any time you’re planning to develop asynchronous code in the .NET Framework from now on, asynchronous methods should be your tool of choice. Still, it’s good for you as a developer to be aware of everything the Framework is doing on your behalf in these asynchronous methods, so you can ensure the end result is as good as it can possibly be.


Stephen Toub is a principal architect on the Parallel Computing Platform team at Microsoft.

Thanks to the following technical experts for reviewing this article: Joe HoagEric Lippert, Danny Shih and Mads Torgersen