Leader Election Pattern

Article
08/26/2015

Coordinate the actions performed by a collection of collaborating task instances in a distributed application by electing one instance as the leader that assumes responsibility for managing the other instances. This pattern can help to ensure that task instances do not conflict with each other, cause contention for shared resources, or inadvertently interfere with the work that other task instances are performing.

Context and Problem

A typical cloud application consists of many tasks acting in a coordinated manner. These tasks could all be instances running the same code and requiring access to the same resources, or they might be working together in parallel to perform the individual parts of a complex calculation.

The task instances might run autonomously for much of the time, but it may also be necessary to coordinate the actions of each instance to ensure that they don’t conflict, cause contention for shared resources, or inadvertently interfere with the work that other task instances are performing. For example:

In a cloud-based system that implements horizontal scaling, multiple instances of the same task could be running simultaneously with each instance servicing a different user. If these instances write to a shared resource, it may be necessary to coordinate their actions to prevent each instance from blindly overwriting the changes made by the others.
If the tasks are performing individual elements of a complex calculation in parallel, the results will need to be aggregated when they all complete.

Because the task instances are all peers, there is no natural leader that can act as the coordinator or aggregator.

Solution

A single task instance should be elected to act as the leader, and this instance should coordinate the actions of the other subordinate task instances. If all of the task instances are running the same code, they could all be capable of acting as the leader. Therefore, the election process must be managed carefully to prevent two or more instances taking over the leader role at the same time.

The system must provide a robust mechanism for selecting the leader. This mechanism must be able to cope with events such as network outages or process failures. In many solutions, the subordinate task instances monitor the leader through some type of heartbeat mechanism, or by polling. If the designated leader terminates unexpectedly, or a network failure renders the leader inaccessible by the subordinate task instances, it will be necessary for them to elect a new leader.

There are several strategies available for electing a leader amongst a set of tasks in a distributed environment, including:

Selecting the task instance with the lowest-ranked instance or process ID.
Racing to obtain a shared distributed mutex. The first task instance that acquires the mutex is the leader. However, the system must ensure that, if the leader terminates or becomes disconnected from the rest of the system, the mutex is released to allow another task instance to become the leader.
Implementing one of the common leader election algorithms such as the Bully Algorithm or the Ring Algorithm. These algorithms are relatively straightforward, but there are also a number of more sophisticated techniques available. These algorithms assume that each candidate participating in the election has a unique ID, and that they can communicate with the other candidates in a reliable manner.

Issues and Considerations

Consider the following points when deciding how to implement this pattern:

The process of electing a leader should be resilient to transient and persistent failures.
It must be possible to detect when the leader has failed or has become otherwise unavailable (perhaps due to a communications failure). The speed at which such detection is required will be system dependent. Some systems may be able to function for a short while without a leader, during which time a transient fault that caused the leader to become unavailable may have been rectified. In other cases, it may be necessary to detect leader failure immediately and trigger a new election.
In a system that implements horizontal autoscaling, the leader could be terminated if the system scales back and shuts down some of the computing resources.
Using a shared distributed mutex introduces a dependency on the availability of the external service that provides the mutex. This service may constitute a single point of failure. If this service should become unavailable for any reason, the system will not be able to elect a leader.
Using a single dedicated process as the leader is a relatively straightforward approach. However, if the process fails there may be a significant delay while it is restarted, and the resultant latency may affect the performance and response times of other processes if they are waiting for the leader to coordinate an operation.
Implementing one of the leader election algorithms manually provides the greatest flexibility for tuning and optimizing the code.

When to Use this Pattern

Use this pattern when the tasks in a distributed application, such as a cloud-hosted solution, require careful coordination and there is no natural leader.

Note

Avoid making the leader a bottleneck in the system. The purpose of the leader is to coordinate the work performed by the subordinate tasks, and it does not necessarily have to participate in this work itself—although it should be capable of doing so if the task is not elected as the leader.

This pattern might not be suitable:

If there is a natural leader or dedicated process that can always act as the leader. For example, it may be possible to implement a singleton process that coordinates the task instances. If this process fails or becomes unhealthy, the system can shut it down and restart it.
If the coordination between tasks can be easily achieved by using a more lightweight mechanism. For example, if several task instances simply require coordinated access to a shared resource, a preferable solution might be to use optimistic or pessimistic locking to control access to that resource.
If a third-party solution is more appropriate. For example, the Microsoft Azure HDInsight service (based on Apache Hadoop) uses the services provided by Apache Zookeeper to coordinate the map/reduce tasks that aggregate and summarize data. It’s also possible to install and configure Zookeeper on a Azure Virtual Machine and integrate it into your own solutions, or use the Zookeeper prebuilt virtual machine image available from Microsoft Open Technologies. For more information, see Apache Zookeeper on Microsoft Azure on the Microsoft Open Technologies website.

Example

The DistributedMutex project in the LeaderElection solution included in the sample code available for this guide shows how to use a lease on a Azure storage blob to provide a mechanism for implementing a shared distributed mutex. This mutex can be used to elect a leader amongst a group of role instances in a Azure cloud service. The first role instance to acquire the lease is elected the leader, and remains the leader until it releases the lease or until it is unable to renew the lease. Other role instances can continue to monitor the blob lease in the event that the leader is no longer available.

Note

A blob lease is an exclusive write lock over a blob. A single blob can be the subject of a maximum of one lease at any one point in time. A role instance can request a lease over a specified blob, and it will be granted the lease if no other lease over the same blob is currently held by this or any other role instance, otherwise the request will throw an exception.
To reduce the possibility that a faulted role instance retains the lease indefinitely, specify a lifetime for the lease. When this expires, the lease becomes available. However, while a role instance holds the lease it can request that the lease is renewed, and it will be granted the lease for a further period of time. The role instance can continually repeat this process if it wishes to retain the lease.
For more information on how to lease a blob, see Lease Blob (REST API) on MSDN.

The BlobDistributedMutex class in the example contains the RunTaskWhenMutexAquired method that enables a role instance to attempt to obtain a lease over a specified blob. The details of the blob (the name, container, and storage account) are passed to the constructor in a BlobSettings object when the BlobDistributedMutex object is created (this object is a simple struct that is included in the sample code). The constructor also accepts a Task that references the code that the role instance should run if it successfully acquires the lease over the blob and is elected the leader. Note that the code that handles the low-level details of obtaining the lease is implemented in a separate helper class named BlobLeaseManager.

public class BlobDistributedMutex{
  ...
  private readonly BlobSettings blobSettings;  private readonly Func<CancellationToken, Task> taskToRunWhenLeaseAcquired;
  ...

  public BlobDistributedMutex(BlobSettings blobSettings,            Func<CancellationToken, Task> taskToRunWhenLeaseAquired)  {    this.blobSettings = blobSettings;    this.taskToRunWhenLeaseAquired = taskToRunWhenLeaseAquired;  }public async Task RunTaskWhenMutexAcquired(CancellationToken token)  {    var leaseManager = new BlobLeaseManager(blobSettings);    await this.RunTaskWhenBlobLeaseAcquired(leaseManager, token);      } 
  ...

The RunTaskWhenMutexAquired method in the code sample above invokes the RunTaskWhenBlobLeaseAcquired method shown in the following code sample to actually acquire the lease. The RunTaskWhenBlobLeaseAcquired method runs asynchronously. If the lease is successfully acquired, the role instance has been elected the leader. The purpose of the taskToRunWhenLeaseAcquired delegate is to perform the work that coordinates the other role instances. If the lease is not acquired, another role instance has been elected as the leader and the current role instance remains a subordinate. Note that the TryAcquireLeaseOrWait method is a helper method that uses the BlobLeaseManager object to obtain the lease.

  ...
  private async Task RunTaskWhenBlobLeaseAcquired(    BlobLeaseManager leaseManager, CancellationToken token)  {while (!token.IsCancellationRequested)    {      // Try to acquire the blob lease.       // Otherwise wait for a short time before trying again.      string leaseId = await this.TryAquireLeaseOrWait(leaseManager, token);if (!string.IsNullOrEmpty(leaseId))      {        // Create a new linked cancellation token source so that if either the         // original token is cancelled or the lease cannot be renewed, the        // leader task can be cancelled.using (var leaseCts =           CancellationTokenSource.CreateLinkedTokenSource(new[] { token }))        {// Run the leader task.          var leaderTask = this.taskToRunWhenLeaseAquired.Invoke(leaseCts.Token);          
          ...
}}}}
  ...

The task started by the leader also executes asynchronously. While this task is running, the RunTaskWhenBlobLeaseAquired method shown in the following code sample periodically attempts to renew the lease. This action helps to ensure that the role instance remains the leader. In the sample solution, the delay between renewal requests is less than the time specified for the duration of the lease in order to prevent another role instance from being elected the leader. If the renewal fails for any reason, the task is cancelled.

If the lease fails to be renewed or the task is cancelled (possibly as a result of the role instance shutting down), the lease is released. At this point, this or another role instance might be elected as the leader. The code extract below shows this part of the process.

  ...
  private async Task RunTaskWhenBlobLeaseAcquired(    BlobLeaseManager leaseManager, CancellationToken token)  {    while (...){
      ...
      if (...)
      {
        ...
        using (var leaseCts = ...)
        {
          ...
          // Keep renewing the lease in regular intervals.           // If the lease cannot be renewed, then the task completes.          var renewLeaseTask =             this.KeepRenewingLease(leaseManager, leaseId, leaseCts.Token);          // When any task completes (either the leader task itself or when it could          // not renew the lease) then cancel the other task.          await CancelAllWhenAnyCompletes(leaderTask, renewLeaseTask, leaseCts);}
      }
    }
  }
  ...
}

The KeepRenewingLease method is another helper method that uses the BlobLeaseManager object to renew the lease. The CancelAllWhenAnyCompletes method cancels the tasks specified as the first two parameters.

Figure 1 illustrates the functions of the BlobDistributedMutex class.

Figure 1 - Using the BlobDistributedMutex class to elect a leader and run a task that coordinates operations

The following code example shows how to use the BlobDistributedMutex class in a worker role. This code obtains a lease over a blob named MyLeaderCoordinatorTask in the leases container in development storage, and specifies that the code defined in the MyLeaderCoordinatorTask method should run if the role instance is elected the leader.

var settings = new BlobSettings(CloudStorageAccount.DevelopmentStorageAccount,   "leases", "MyLeaderCoordinatorTask");var cts = new CancellationTokenSource();var mutex = new BlobDistributedMutex(settings, MyLeaderCoordinatorTask);mutex.RunTaskWhenMutexAcquired(this.cts.Token);
...

// Method that runs if the role instance is elected the leaderprivate static async Task MyLeaderCoordinatorTask(CancellationToken token){
  ...
}

Note the following points about the sample solution:

The blob is a potential single point of failure. If the blob service becomes unavailable, or the blob is inaccessible, the leader will be unable to renew the lease and no other role instance will be able to obtain the lease. In this case, no role instance will be able to act as the leader. However, the blob service is designed to be resilient, so complete failure of the blob service is considered to be extremely unlikely.
If the task being performed by the leader stalls, the leader might continue to renew the lease, preventing any other role instance from obtaining the lease and taking over the leader role in order to coordinate tasks. In the real world, the health of the leader should be checked at frequent intervals.
The election process is non-deterministic. You cannot make any assumptions about which role instance will obtain the blob lease and become the leader.
The blob used as the target of the blob lease should not be used for any other purpose. If a role instance attempts to store data in this blob, this data will not be accessible unless the role instance is the leader and holds the blob lease.

The following guidance may also be relevant when implementing this pattern:

Autoscaling Guidance. It may be possible to start and stop instances of the task hosts as the load on the application varies. Autoscaling can help to maintain throughput and performance during times of peak processing.
Compute Partitioning Guidance. This guidance describes how to allocate tasks to hosts in a cloud service in a way that helps to minimize running costs while maintaining the scalability, performance, availability, and security of the service.

More Information

The Task-based Asynchronous Pattern on MSDN.
An example illustrating the Bully Algorithm.
An example illustrating the Ring Algorithm.
The article Apache Zookeeper on Microsoft Azure on the Microsoft Open Technologies website.
The article Lease Blob (REST API) on MSDN.

This pattern has a sample application associated with it. You can download the "Cloud Design Patterns – Sample Code" from the Microsoft Download Center at https://aka.ms/cloud-design-patterns-sample.

Next Topic | Previous Topic | Home | Community

Share via

Leader Election Pattern

Context and Problem

Solution

Issues and Considerations

When to Use this Pattern

Example

More Information

Additional resources

Share via

Leader Election Pattern

Context and Problem

Solution

Issues and Considerations

When to Use this Pattern

Example

Related Patterns and Guidance

More Information

Additional resources