Implement custom storage for your bot

Article
11/01/2022

APPLIES TO: SDK v4

A bot's interactions fall into three areas: the exchange of activities with Azure AI Bot Service, the loading and saving of bot and dialog state with a memory store, and integration with back-end services.

Interaction diagram outlining relationship between the Azure AI Bot Service, a bot, a memory store, and other services.

This article explores how to extend the semantics between the Azure AI Bot Service and the bot's memory state and storage.

Note

The Bot Framework JavaScript, C#, and Python SDKs will continue to be supported, however, the Java SDK is being retired with final long-term support ending in November 2023.

Existing bots built with the Java SDK will continue to function.

For new bot building, consider using Microsoft Copilot Studio and read about choosing the right copilot solution.

For more information, see The future of bot building.

Prerequisites

Knowledge of Basics of the Microsoft Bot Framework, Event-driven conversations using an activity handler, and Managing state.
A copy of the scale-out sample in C#, Python, or Java.

This article focuses on the C# version of the sample.

Background

The Bot Framework SDK includes a default implementation of bot state and memory storage. This implementation fits the needs of applications where the pieces are used together with a few lines of initialization code, as demonstrated in many of the samples.

The SDK is a framework and not an application with fixed behavior. In other words, the implementation of many of the mechanisms in the framework is a default implementation and not the only possible implementation. The framework doesn't dictate the relationship between the exchange of activities with Azure AI Bot Service and the loading and saving of any bot state.

This article describes one way to modify the semantics of the default state and storage implementation when it doesn't quite work for your application. The scale-out sample provides an alternate implementation of state and storage that has different semantics than the default ones. This alternate solution sits equally well in the framework. Depending on your scenario, this alternate solution may be more appropriate for the application you're developing.

Behavior of the default adapter and storage provider

With the default implementation, on receiving an activity, the bot loads the state corresponding to the conversation. It then runs the dialog logic with this state and the inbound activity. In the process of running the dialog, one or more outbound activities are created and immediately sent. When the processing of the dialog is complete, the bot saves the updated state, overwriting the old state.

Sequence diagram showing the default behavior of a bot and its memory store.

However, a few things can go wrong with this behavior.

If the save operation fails for some reason, then state has implicitly slipped out of sync with what the user sees on the channel. The user has seen responses from the bot and believes that the state has moved forward, but it hasn't. This error can be worse than if the state update succeeded but the user didn't receive the response messages.

Such state errors can have implications for your conversation design. For example, the dialog might require extra, otherwise redundant, confirmation exchanges with the user.
If the implementation is deployed scaled out across multiple nodes, the state can accidentally get overwritten. This error can be confusing because the dialog will likely have sent activities to the channel carrying confirmation messages.

Consider a pizza order bot, where the bot asks user for topping choices, and the user sends two rapid messages: one to add mushrooms and one to add cheese. In a scaled-out scenario, multiple instances of the bot might be active, and the two user messages could be handled by two separate instances on separate machines. Such a conflict is referred to as a race condition, where one machine might overwrite the state written by another. However, because the responses were already sent, the user received confirmation that both mushroom and cheese were added to their order. Unfortunately, when the pizza arrives, it only contains mushroom or cheese, but not both.

Optimistic locking

The scale-out sample introduces some locking around the state. The sample implements optimistic locking, which lets each instance run as if it were the only one running and then check for any concurrency violations. This locking may sound complicated, but known solutions exist, and you can use cloud storage technologies and the right extension points in the Bot Framework.

The sample uses a standard HTTP mechanism based on the entity tag header (ETag). Understanding this mechanism is crucial to understanding the code that follows. The following diagram illustrates the sequence.

Sequence diagram showing a race condition, with the second update failing.

The diagram has two clients that are performing an update to some resource.

When a client issues a GET request and a resource is returned from the server, the server includes an ETag header.

The ETag header is an opaque value that represents the state of the resource. If a resource is changed, the server updates its ETag for the resource.
When the client wants to persist a state change, it issues a POST request to the server, with the ETag value in an If-Match precondition header.
If the request's ETag value doesn't match the server's, then the precondition check fails with a 412 (Precondition Failed) response.

This failure indicates that the current value on server no longer matches the original value the client was operating on.
If the client receives a precondition failed response, the client typically gets a fresh value for the resource, applies the update it wanted, and attempts to post the resource update again.

This second POST request succeeds if no other client has updated the resource. Otherwise, the client can try again.

This process is called optimistic because the client, once it has a resource, proceeds to do its processing—the resource itself isn't locked, as other clients can access it without any restriction. Any contention between clients over what the state of the resource should be isn't determined until the processing has been done. In a distributed system, this strategy is often more optimal than the opposite pessimistic approach.

The optimistic locking mechanism as described assumes that your program logic can be safely retried. The ideal situation is where these service requests are idempotent. In computer science, an idempotent operation is one that has no extra effect if it's called more than once with the same input parameters. Pure HTTP REST services that implement the GET, PUT, and DELETE requests are often idempotent. If a service request won't produce extra effects, then requests can be safely re-executed as part of a retry strategy.

The scale-out sample and the remainder of this article assume that the backend services your bot uses are all idempotent HTTP REST services.

Buffering outbound activities

Sending an activity isn't an idempotent operation. The activity is often a message that relays information to the user, and repeating the same message two or more times might be confusing or misleading.

Optimistic locking implies that your bot logic may need to be rerun multiple times. To avoid sending any given activity multiple times, wait for the state update operation to succeed before sending activities to the user. Your bot logic should look something like the following diagram.

Sequence diagram with messages being sent after dialog state is saved.

Once you build a retry loop into your dialog execution, you have the following behavior when there's a precondition failure on the save operation.

Sequence diagram with messages being sent after a retry attempt succeeds.

With this mechanism in place, the pizza bot from the earlier example should never send an erroneous positive acknowledgment of a pizza topping being added to an order. Even with the bot deployed across multiple machines, the optimistic locking scheme effectively serializes the state updates. In the pizza bot, the acknowledgment from adding an item can now even reflect the full state accurately. For example, if the user quickly types "cheese" and then "mushroom", and these messages are handled by two different instances of the bot, the last instance to complete can include "a pizza with cheese and mushroom" as part of its response.

This new custom storage solution does three things that the default implementation in the SDK doesn't do:

It uses ETags to detect contention.
It retries the processing when an ETag failure is detected.
It waits to send outbound activities until it has successfully saved state.

The remainder of this article describes the implementation of these three parts.

Implement ETag support

First, define an interface for our new store that includes ETag support. The interface helps use the dependency injection mechanisms in ASP.NET. Starting with the interface allows you to implement separate versions for unit tests and for production. For example, the unit test version might run in memory and not require a network connection.

The interface consists of load and save methods. Both methods will use a key parameter to identify the state to load from or save to storage.

Load will return the state value and associated ETag.
Save will have parameters for the state value and associated ETag and return a Boolean value that indicates whether the operation succeeded. The return value won't serve as a general error indicator, but instead as a specific indicator of precondition failure. Checking the return code will part of the logic of the retry loop.

To make the storage implementation widely applicable, avoid placing serialization requirements on it. However, many modern storage services support JSON as the content-type. In C#, you can use the JObject type to represent a JSON object. In JavaScript or TypeScript, JSON is a regular native object.

Here's a definition of the custom interface.

IStore.cs

public interface IStore
{
    Task<(JObject content, string etag)> LoadAsync(string key);

    Task<bool> SaveAsync(string key, JObject content, string etag);
}

Here's an implementation for Azure Blob Storage.

BlobStore.cs

public class BlobStore : IStore
{
    private readonly CloudBlobContainer _container;

    public BlobStore(string accountName, string accountKey, string containerName)
    {
        if (string.IsNullOrWhiteSpace(accountName))
        {
            throw new ArgumentException(nameof(accountName));
        }

        if (string.IsNullOrWhiteSpace(accountKey))
        {
            throw new ArgumentException(nameof(accountKey));
        }

        if (string.IsNullOrWhiteSpace(containerName))
        {
            throw new ArgumentException(nameof(containerName));
        }

        var storageCredentials = new StorageCredentials(accountName, accountKey);
        var cloudStorageAccount = new CloudStorageAccount(storageCredentials, useHttps: true);
        var client = cloudStorageAccount.CreateCloudBlobClient();
        _container = client.GetContainerReference(containerName);
    }

    public async Task<(JObject content, string etag)> LoadAsync(string key)
    {
        if (string.IsNullOrWhiteSpace(key))
        {
            throw new ArgumentException(nameof(key));
        }

        var blob = _container.GetBlockBlobReference(key);
        try
        {
            var content = await blob.DownloadTextAsync();
            var obj = JObject.Parse(content);
            var etag = blob.Properties.ETag;
            return (obj, etag);
        }
        catch (StorageException e)
            when (e.RequestInformation.HttpStatusCode == (int)HttpStatusCode.NotFound)
        {
            return (new JObject(), null);
        }
    }

    public async Task<bool> SaveAsync(string key, JObject obj, string etag)
    {
        if (string.IsNullOrWhiteSpace(key))
        {
            throw new ArgumentException(nameof(key));
        }

        if (obj == null)
        {
            throw new ArgumentNullException(nameof(obj));
        }

        var blob = _container.GetBlockBlobReference(key);
        blob.Properties.ContentType = "application/json";
        var content = obj.ToString();
        if (etag != null)
        {
            try
            {
                await blob.UploadTextAsync(content, Encoding.UTF8, new AccessCondition { IfMatchETag = etag }, new BlobRequestOptions(), new OperationContext());
            }
            catch (StorageException e)
                when (e.RequestInformation.HttpStatusCode == (int)HttpStatusCode.PreconditionFailed)
            {
                return false;
            }
        }
        else
        {
            await blob.UploadTextAsync(content);
        }

        return true;
    }
}

Azure Blob Storage does much of the work. Each method checks for a specific exception to meet the expectations of the calling code.

The LoadAsync method, in response to a storage exception with a not found status code, returns a null value.
The SaveAsync method, in response to a storage exception with a precondition failed code, returns false.

Implement a retry loop

The design of the retry loop implements the behavior shown in the sequence diagrams.

On receiving an activity, create a key for the conversation state.

The relationship between an activity and the conversation state is the same for the custom storage as for the default implementation. Therefore, you can construct the key the same way that the default state implementation does.
Attempt to load the conversation state.
Run the bot's dialogs and capture the outbound activities to send.
Attempt to save the conversation state.
- On success, send the outbound activities and exit.
- On failure, repeat this process from the step to load the conversation state.
  
  The new load of conversation state gets a new and current ETag and conversation state. The dialog is rerun, and the save state step has a chance to succeed.

Here's an implementation for the message activity handler.

ScaleoutBot.cs

protected override async Task OnMessageActivityAsync(ITurnContext<IMessageActivity> turnContext, CancellationToken cancellationToken)
{
    // Create the storage key for this conversation.
    var key = $"{turnContext.Activity.ChannelId}/conversations/{turnContext.Activity.Conversation?.Id}";

    // The execution sits in a loop because there might be a retry if the save operation fails.
    while (true)
    {
        // Load any existing state associated with this key
        var (oldState, etag) = await _store.LoadAsync(key);

        // Run the dialog system with the old state and inbound activity, the result is a new state and outbound activities.
        var (activities, newState) = await DialogHost.RunAsync(_dialog, turnContext.Activity, oldState, cancellationToken);

        // Save the updated state associated with this key.
        var success = await _store.SaveAsync(key, newState, etag);

        // Following a successful save, send any outbound Activities, otherwise retry everything.
        if (success)
        {
            if (activities.Any())
            {
                // This is an actual send on the TurnContext we were given and so will actual do a send this time.
                await turnContext.SendActivitiesAsync(activities, cancellationToken);
            }

            break;
        }
    }
}

Note

The sample implements dialog execution as a function call. A more sophisticated approach might be to define an interface and use dependency injection. For this example, however, the static function emphasizes the functional nature of this optimistic locking approach. In general, when you implement the crucial parts of your code in a functional way, you improve its chances to work successfully on networks.

Implement an outbound activity buffer

The next requirement is to buffer outbound activities until after a successful save operation happens, which requires a custom adapter implementation. The custom SendActivitiesAsync method shouldn't send the activities to the use, but add the activities to a list. Your dialog code won't need modification.

In this particular scenario, the update activity and delete activity operations aren't supported and the associated methods will throw not implemented exceptions.
The return value from the send activities operation is used by some channels to allow a bot to modify or delete a previously sent message, for example, to disable buttons on cards displayed in the channel. These message exchanges can get complicated, particularly when state is required, and are outside the scope of this article.
Your dialog creates and uses this custom adapter, so it can buffer activities.
Your bot's turn handler will use a more standard AdapterWithErrorHandler to send the activities to the user.

Here's an implementation of the custom adapter.

DialogHostAdapter.cs

public class DialogHostAdapter : BotAdapter
{
    private List<Activity> _response = new List<Activity>();

    public IEnumerable<Activity> Activities => _response;

    public override Task<ResourceResponse[]> SendActivitiesAsync(ITurnContext turnContext, Activity[] activities, CancellationToken cancellationToken)
    {
        foreach (var activity in activities)
        {
            _response.Add(activity);
        }

        return Task.FromResult(new ResourceResponse[0]);
    }

    #region Not Implemented
    public override Task DeleteActivityAsync(ITurnContext turnContext, ConversationReference reference, CancellationToken cancellationToken)
    {
        throw new NotImplementedException();
    }

    public override Task<ResourceResponse> UpdateActivityAsync(ITurnContext turnContext, Activity activity, CancellationToken cancellationToken)
    {
        throw new NotImplementedException();
    }
    #endregion
}

Use your custom storage in a bot

The last step is to use these custom classes and methods with existing framework classes and methods.

The main retry loop becomes part of your bot's ActivityHandler.OnMessageActivityAsync method and includes your custom storage through dependency injection.
The dialog hosting code is added to DialogHost class that exposes a static RunAsync method. The dialog host:
- Takes the inbound activity and the old state and then returns the resulting activities and new state.
- Creates the custom adapter and otherwise runs the dialog in the same way as the SDK does.
- Creates a custom state property accessor, a shim that passes the dialog state into the dialog system. The accessor uses reference semantics to pass an accessor handle to the dialog system.

Tip

The JSON serialization is added inline to the hosting code to keep it outside of the pluggable storage layer, so that different implementations can serialize differently.

Here's an implementation of the dialog host.

DialogHost.cs

public static class DialogHost
{
    // The serializer to use. Moving the serialization to this layer will make the storage layer more pluggable.
    private static readonly JsonSerializer StateJsonSerializer = new JsonSerializer() { TypeNameHandling = TypeNameHandling.All };

    /// <summary>
    /// A function to run a dialog while buffering the outbound Activities.
    /// </summary>
    /// <param name="dialog">THe dialog to run.</param>
    /// <param name="activity">The inbound Activity to run it with.</param>
    /// <param name="oldState">Th eexisting or old state.</param>
    /// <returns>An array of Activities 'sent' from the dialog as it executed. And the updated or new state.</returns>
    public static async Task<(Activity[], JObject)> RunAsync(Dialog dialog, IMessageActivity activity, JObject oldState, CancellationToken cancellationToken)
    {
        // A custom adapter and corresponding TurnContext that buffers any messages sent.
        var adapter = new DialogHostAdapter();
        var turnContext = new TurnContext(adapter, (Activity)activity);

        // Run the dialog using this TurnContext with the existing state.
        var newState = await RunTurnAsync(dialog, turnContext, oldState, cancellationToken);

        // The result is a set of activities to send and a replacement state.
        return (adapter.Activities.ToArray(), newState);
    }

    /// <summary>
    /// Execute the turn of the bot. The functionality here closely resembles that which is found in the
    /// IBot.OnTurnAsync method in an implementation that is using the regular BotFrameworkAdapter.
    /// Also here in this example the focus is explicitly on Dialogs but the pattern could be adapted
    /// to other conversation modeling abstractions.
    /// </summary>
    /// <param name="dialog">The dialog to be run.</param>
    /// <param name="turnContext">The ITurnContext instance to use. Note this is not the one passed into the IBot OnTurnAsync.</param>
    /// <param name="state">The existing or old state of the dialog.</param>
    /// <returns>The updated or new state of the dialog.</returns>
    private static async Task<JObject> RunTurnAsync(Dialog dialog, ITurnContext turnContext, JObject state, CancellationToken cancellationToken)
    {
        // If we have some state, deserialize it. (This mimics the shape produced by BotState.cs.)
        var dialogStateProperty = state?[nameof(DialogState)];
        var dialogState = dialogStateProperty?.ToObject<DialogState>(StateJsonSerializer);

        // A custom accessor is used to pass a handle on the state to the dialog system.
        var accessor = new RefAccessor<DialogState>(dialogState);

        // Run the dialog.
        await dialog.RunAsync(turnContext, accessor, cancellationToken);

        // Serialize the result (available as Value on the accessor), and put its value back into a new JObject.
        return new JObject { { nameof(DialogState), JObject.FromObject(accessor.Value, StateJsonSerializer) } };
    }
}

And finally, here's an implementation of the custom state property accessor.

RefAccessor.cs

public class RefAccessor<T> : IStatePropertyAccessor<T>
    where T : class
{
    public RefAccessor(T value)
    {
        Value = value;
    }

    public T Value { get; private set; }

    public string Name => nameof(T);

    public Task<T> GetAsync(ITurnContext turnContext, Func<T> defaultValueFactory = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        if (Value == null)
        {
            if (defaultValueFactory == null)
            {
                throw new KeyNotFoundException();
            }

            Value = defaultValueFactory();
        }

        return Task.FromResult(Value);
    }

    #region Not Implemented
    public Task DeleteAsync(ITurnContext turnContext, CancellationToken cancellationToken = default(CancellationToken))
    {
        throw new NotImplementedException();
    }

    public Task SetAsync(ITurnContext turnContext, T value, CancellationToken cancellationToken = default(CancellationToken))
    {
        throw new NotImplementedException();
    }
    #endregion
}

Additional information

The scale-out sample is available from the Bot Framework samples repo on GitHub in C#, Python, and Java.

Share via