Ways to resolve failures in durable functions

Andrew Syrov 46 Reputation points
2021-11-16T04:05:38.367+00:00

The following reference describes how to handle failures in durable functions:
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-error-handling?tabs=csharp

If we take the example of transferring money from one account to another. I need to understand how I can really make it bulletproof reliable. Say the function failed. It can be in any place: Debit, Credit or Refund or somewhere in between, or process crashed or code inside exception failed, just before a refund or after.

My understanding is that we can periodically (btw, is there an Event Grid help to trigger in case of failed durable function) query list of failed durable function instances, get their parameters and create a report for resolution or retry/compensate by running another function. Is this a way to go?

Thanks!

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
2,623 questions
No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Ryan Hill 16,251 Reputation points Microsoft Employee
    2021-11-18T04:53:42.543+00:00

    Hi @Andrew Syrov ,

    To answer your question, that's certainly a way to go. Think about it as the microservice orchestration pattern.

    Your orchestration trigger will call an activity that's triggered by a service bus message. You can make the activity functions as broad or as granular you want but each one would modify the payload that will be returned to the orchestration trigger function to be passed to the next activity. If there's a failure within the activity, initiate a rollback state and requeue that payload back to the service bus. Your payload can either have a lastProcessStep or can be reprocessed as it's the first time.

    That way, using example in the referenced doc, when catch (Exception) occurs and you refund the account, take context.GetInput<TransferOperation>(); and requeue it for reprocessing. When it's reprocessed, that step will allow the payload to pass previous activities because it's already been updated or logging the payload in a queue or table for manual investigation and processing.

    ---
    EDIT:

    Case 1. If I'm not using Service Bus, (say, my durable function is triggered by HTTP trigger, or some other method, such as time trigger).

    If you're using a HTTP or timer trigger to tell the function hey you, there's data to process then the same principles apply. Whatever data store you're using, the data being read should have some sort of indicator that can tell function where it last left off or reprocess the data where no changes for completed steps.

    Case 2 (you somehow brought service bus into the picture, yet, let's discuss it):

    I mentioned service bus and TTLExpiredException because you can leverage that trigger type to easily rerun your workflow in the event of a transient error. Your durable function shouldn't be designed in such a way they can run indefinitely. When the transfer from one back to another is initiated, that state can be recorded, and that operation considered complete. You then await for confirmation from the other bank that the transfer has been completed which triggers a separate activity to complete the process on your end. This way you can have a report of transactions that have been completed and awaiting confirmation and take decisive action.

    I do not see any value in using a durable function with Service Bus

    The advantage to using a service bus is the ability to use topics that your durable functions can subscribe to.
    about-service-bus-topic.png
    Using the above image as a guide for our banking example, the first row could be transfer-funds topic and your df can have activities like fundAvailable, bankRegistered, accountVerfied, etc. which will add message to the second row. The second row can be regulatory which is the logic for all rules necessary for transferring funds between institutions. The third row can be instituion-confirmation where once the external bank was confirmed receipt of the funds, can call a separate function (not part of your durable function) will take the message and add it to that service hub topic.

    At any point of a transient failure, the message will be placed back on the topic to be reprocessed. For instance, if it failed at bankRegistered, just rerun from the beginning. But let's say you made to accountVerified, your data store will have that transfer marked as pending and any reruns from there on will bypass those first activities. In the case of business logic failures, you can push those messages off to the dead-letter queue which a seaparate function is subscribed to that can alert/email/etc. that something went wrong in the process.

    ---
    EDIT (2):

    Have a look at Manage instances in Durable Functions. There are two options for detecting unhandled exceptions:

    1. Use instance query APIs to query the status and check for any failures
    2. Set up Event Grid notifications to receive notifications about failures.

    The default configuration for durable function orchestration state is stored in Azure Storage which will remain there unless you purge it. Therefore, in the event you need to reprocess a message due to a failure, you can retrieve the data using one of the two above methods.

    If your activity hits an unhandled exception, the retry policy should be used to keep trying that activity until it succeeds and compensate for any errors encountered during the retry. In the event of a process crash, the underlying framework will retry automatically. And since durable functions, and functions for that matter, is queue-based you must ensure that the code in each activity is idempotent since they may run more than once if the process crashed after the activity started executing but before the result was persisted. Each invocation of the durable function returns an instance GUID id that you can use for tracking.


  2. Andrew Syrov 46 Reputation points
    2021-11-18T17:49:02.03+00:00

    Hi @Ryan Hill ,

    Your response makes sense. Yet, can you also advise in more detail on the unlucky case when the app crashed (including the failure of the entire region)?

    (Just to re-iterate) Case 1: we caught the exception (as you describe), submitted a "refund" back to the service bus, and gracefully exited. In this case, my durable function instance is marked as "Completed", and for us, it is as good as if it processed a payment without any exception.

    (Case we need to resolve) Case 2: my app (durable function) just crashed or the exception type was not caught. In this case, my function is marked with "Failed" (in case of unhandled exception) or "Terminated" (process crashed, this assumes azure infrastructure has a way to determine this, hope it is). So, say we have another function, which is triggered on a timer, say every second, or so. This function queries the list of functions that are "Failed" or "Terminated" within some timeframe window. Now we have all failed instances. We also know their parameters, so we can reduce this case to first by submitting a "refund" message to the service bus since we know the parameters. In addition, we can have the time of the original request or request count. If, say, "Refund" fails repeatedly for X times, we have a dead-letter, we need to submit this into the manual resolution system (this can happen to say if we need to refund to a just closed account).

    Lastly, let us assume we want a system that is geo-redundant (GZRS), and eventual consistency is not working in our banking case, we need a way to prefer consistency over availability. We need to avoid cases when we started a transaction, i.e. took money from one account, then the entire region crashed, we switched to another region, yet that failed payment is not *(eventually) there (and never will be). It is better for use to deny a payment than to be inconsistent.

    How would you recommend resolving this? (It looks like I need to log all my important app steps to Cosmos DB with strong consistency between regions). Is this reasonable?

    Thank you very much for helping,
    Andriy

    No comments

  3. Andrew Syrov 46 Reputation points
    2021-11-20T21:35:24.397+00:00

    @Ryan Hill ,

    Thank you very much for allocating time for this. I really appreciate this and it will help me to build a better product in Azure. Yet, please review my summary here:

    Case 1. If I'm not using Service Bus, (say, my durable function is triggered by HTTP trigger, or some other method, such as time trigger). I need carefully watch my durable function state, and if the function fails with Failed/terminated status, run some other process/function to compensate. I need to pool on it, as there is no support from Event Grid (or similar) for such cases.

    Case 2 (you somehow brought service bus into the picture, yet, let's discuss it): I'm using Service Bus. You mention that we could use TTLExpiredException. Frankly, I cannot see how this helps in general. Durable functions may take a short-to-infinitely-long time to finish, say between milliseconds to months. I do not see how this works for TTL. Transfer of money from one bank to another can take days or weeks or milliseconds. Making TTL to a large value is not reasonable. Maybe there is a re-delivery/retry of a message if the subscriber gets disconnected? (I do not see this in docs, also keeping connections for so long may not be possible (when a function goes to sleep) or reliable). My understanding is that TTL just relies on timeout, and in general, a timeout may be months, while failure can happen in milliseconds. So TTL is very good, but only for something very short. Or should I constantly, from my durable function, renew message lock, say I'm waiting for an external system, yet need to wake up every other second/minute to say to the service bus that I'm still alive. In a summary: I do not see any value in using a durable function with Service Bus: If I'm to rely on Service Bus, why not just use short regular azure app functions, which react to an event, process the event, in short, predictable time, and quickly send another event through service bus, in a choreography saga manner.

    Makes sense?

    No comments