Handling Compensation and Rollback in a Synchronous API-Based Saga Choreography for Microservices

Question

Handling Compensation and Rollback in a Synchronous API-Based Saga Choreography for Microservices

Binu S 0

In a microservices-based online order processing system using a saga choreography approach with synchronous API calls, a customer places an order involving multiple microservices: Order Service, Inventory Service, and Payment Service. After successful order placement, the Inventory Service decrements stock, and the Payment Service processes the payment. If a failure occurs during the compensation transaction to refund the payment amount after multiple retries, how can the system automatically roll back the inventory update without manual intervention or introducing a single point of failure, given that no queue mechanism is used and all communication is synchronous?

Binu S 0 Reputation points

2024-08-01T17:55:59.91+00:00

0

The challenge we're facing is how to handle failures and rollbacks. For example, if the payment process fails after the inventory has been updated, we need to undo the stock update to maintain consistency. We're trying to solve this problem using a saga choreography approach, where each service communicates with the others through synchronous API calls.

The main difficulty is ensuring that if a compensating action, like refunding a payment, fails, we can still roll back other actions (like updating inventory) without manual intervention. We can't use message queues or other asynchronous mechanisms, which makes it harder to coordinate these compensations and prevent inconsistencies. We're looking for a solution that can handle these situations automatically and reliably, without creating a single point of failure.

We tried implementing the saga choreography pattern with synchronous API calls between our microservices (Order Service, Inventory Service, and Payment Service). The idea was that each service would call the next one in line and handle compensations if something went wrong.

Here's what we did:

Order Placement: The Order Service created an order and called the Inventory Service to update the stock. Stock Update: The Inventory Service decremented the stock and then called the Payment Service to process the payment. Payment Processing: The Payment Service processed the payment and confirmed the transaction. We expected that if any step failed, the services would call each other to perform compensating actions. For example, if the payment failed, the Inventory Service would be called to revert the stock update.

What We Expected:

Automatic Compensation: We expected the system to automatically roll back previous actions if something went wrong, ensuring data consistency across all services. Reliability: We aimed for a reliable process without manual intervention, even in cases of failure, and without introducing a single point of failure. However, we encountered issues with handling compensation transactions, especially when they failed after multiple retries. This made it challenging to maintain consistency without manual intervention, and we needed a way to ensure that compensations (like rolling back inventory updates) could be reliably performed.

2 answers

Your answer

Answer 1

Erland Sommarskog 121.9K MVP Volunteer Moderator

The traditional way of doing this is to have a transaction around the whole thing, and which will not commit until every thing is done. How you will implement in your Saga dance number, I don't know, as I am completely unfamiliar with that technology.

What I can say, speaking from a the traditional database perspective that I come from, is that anything with compensating transactions etc, will be a lot more complex to implement, and it will also look less pretty on account statements and similar things it transactions are booked and then counterbooked within seconds.

Binu S 0 Reputation points

2024-08-02T02:50:35.2633333+00:00

Thank you for your response. I understand the traditional approach of using transactions that only commit when everything is done. However, I'm specifically looking into the saga choreography pattern, which is a bit different.

In our system, we want to handle cases where a payment refund might fail after several retries, and we need to automatically roll back the inventory update without manual intervention or introducing a central coordinator (since we're not using queues).

It seems the answer didn't fully cover how to handle this in a saga choreography setup. I appreciate your insights from the traditional database perspective, but I'm looking for a solution that fits the microservices architecture we're using. If you have any additional thoughts on handling compensating transactions automatically in this context, that would be helpful
Erland Sommarskog 121.9K Reputation points MVP Volunteer Moderator

2024-08-02T20:38:02.38+00:00

It is true that my Answer did not really answer your question, and maybe I should have made it as a comment instead.

Using traditional transactions is simple, because the DBMS takes care of the heavy work. But it can also mean that rows are locked for a longer period, and longer than you can permit.

If you roll your own transactions, you will also have to implement your own rollback. You can't expect the "system" to do it for you, because the system have no knowledge about your business logic. That is, the "system" does not know how do undo an inventory update.

Depending on business requirements, auditing of the rollbacks may be required. If the inventory is reduced by a transaction that will fail later, another order cannot be fulfilled in the meanwhile, because the inventory was too small at the point.

I am not saying that it can't be done, but it will be a lot more complex to implement, and it will require a high level of skill from your development team.
Binu S 0 Reputation points

2024-08-03T16:20:09.7866667+00:00

Thank you for clarifying and acknowledging the limitations of using traditional transactions and the challenges of implementing custom rollback mechanisms. I understand that relying on the DBMS simplifies transaction management but may lead to issues like long row locks.

Regarding rolling our own transactions, you're correct that the system won't inherently know how to handle business-specific logic, like undoing an inventory update. The complexity of implementing such a system and the need for careful auditing and skillful execution are clear.

Given these considerations, could you provide any additional insights or best practices for managing custom rollbacks and ensuring data consistency in a microservices architecture? Additionally, how would you approach implementing such a system to minimize complexity and potential issues?
Binu S 0 Reputation points

2024-08-03T16:28:50.2666667+00:00

Thank you for the detailed explanation. I'm also considering using TransactionScope in .NET to manage transactions across multiple microservices. Do you think this approach could help mitigate the transaction issues we've discussed, especially when dealing with scenarios involving two or more microservices? How feasible is it to use TransactionScope in a microservices architecture, and are there specific challenges or limitations I should be aware of?
Erland Sommarskog 121.9K Reputation points MVP Volunteer Moderator

2024-08-03T16:39:28.93+00:00

With TransactionScope you are exactly into the transaction pattern I prefer. And I would suggest that you start there. Yes, if that doesn't fly, because you get lock contention, you will be in for a major redesign. But on the other hand to answer this question:

Given these considerations, could you provide any additional insights or best practices for managing custom rollbacks and ensuring data consistency in a microservices architecture? Additionally, how would you approach implementing such a system to minimize complexity and potential issues?

I would start with asking for the development budget to be tripled.

And designing the system for a load it may never get is quite a waste of money.

Now, I should add that there are more patterns for partial transactions. To take an example from this forum. A poster had a problem with a pattern where they inserted first one row into a master table and then quite a few rows into a detail table in a transaction. Due to a combination of unfortunate circumstances, they saw deadlocks when inserting rows for unrelated master records in parallel.

The final response was that they were considered breaking up the transaction and keep track of that the master record was incomplete. That is, you insert data and commit, but you have a flag that says that data is not complete. Whether this is even applicable in your scenario, but I mention it as a food for thought. (And obviously, this mean that you need to go back and clear these "incomplete" flags at some point. Which may not rhyme well with the concept of "microservices".)
Binu S 0 Reputation points

2024-08-04T15:32:34.6966667+00:00

Thank you for the detailed insights. I appreciate the recommendation to start with TransactionScope and the consideration of potential lock contention issues. The example of handling incomplete transactions is helpful for understanding alternative approaches.

Given these complexities, do you think implementing a two-phase commit could be a viable option in our scenario? What are the potential challenges we should consider if we go down that route?
Erland Sommarskog 121.9K Reputation points MVP Volunteer Moderator

2024-08-04T15:57:59.51+00:00

The only reason you would implement two-phase commit is that you have distributed data sources. If lock contention is your concern, two-phase commit is not what you want, since this typically makes the transaction longer.

And as Bruce points out, you would not implement two-phase commit yourself, but you would rely on your API. I am fairly sure that the TransactionScope classes can deal with distributed data sources and do two-phase commit if needed.

Answer 2

Bruce (SqlWork.com) 78,006 Volunteer Moderator

The saga pattern is for Async eventually consistent transactions. The pattern requires atomic updates to a persistent store to keep context changes, and updates to the event queue.

you are just using some of the techniques to implement the pattern. I’m not sure how your transaction orchestrator handles a crash, loss of network, or down data source. Seems like in your design it just results in a series of orphaned transactions. As the data source can go down, it import to atomically log the transaction to the data source before beginning of each saga transaction.

Binu S 0 Reputation points

2024-08-03T16:17:38.1133333+00:00

Thank you for your insights. You raised some important points about handling failures and ensuring atomic updates in the saga pattern. Could you elaborate on specific strategies or techniques to prevent orphaned transactions and ensure data consistency, especially in cases of network failures or down data sources? Also, is using a two-phase commit a viable option in this scenario? How would you suggest implementing atomic logging before each saga transaction?
Binu S 0 Reputation points

2024-08-03T16:21:45.5033333+00:00

Thank you for your insights. You raised some important points about handling failures and ensuring atomic updates in the saga pattern. Could you elaborate on specific strategies or techniques to prevent orphaned transactions and ensure data consistency, especially in cases of network failures or down data sources? Also, is using a two-phase commit a viable option in this scenario? How would you suggest implementing atomic logging before each saga transaction?
Bruce (SqlWork.com) 78,006 Reputation points Volunteer Moderator

2024-08-03T20:20:36.0733333+00:00

Two phase commit is very difficult to write, so most developers only use the two phase commit supported by a traditional sql engines. It will only be required if your databases span servers. It is still a local transaction if more than one database on the same server is enlisted in the same transaction scope. This will require a homogeneous environment (same vendor) and of course a database that supports two phases commit. As the connections need to stay open until the commit, this difficult with 2 phase commit. Finally 2 phase commit is not 100% reliable. You will need to design support to detect failed commit. There are scaling issues with 2 phase commit.

the saga pattern is the most common alternative to 2 phase commit and matches the microservice model. You need an orchestrater, a reliable store (typically a database), a durable queue, and the microservice transactions should be idempotent (and can safely repeat the transaction) to handle a network error. The orchestration would log each phase before running, so that recovery can happen. As you want a sync orchestration (may cause scaling issues), you will need an async recovery orchestrater The recovery would check the process queue for failed orchestration, and perform recovery. So your sync orchestration should bail on any error or timeout and let the async cleanup.
Binu S 0 Reputation points

2024-08-04T15:37:01.8833333+00:00

Thank you for the detailed explanation of the challenges and considerations with two-phase commit and the saga pattern. I understand the need for an orchestrator, a reliable store, and idempotent transactions in the saga pattern. Given the potential issues with scaling and the need for both sync and async orchestration, could you provide guidance on how to implement synchronous orchestration effectively? Specifically, how can we manage sync orchestration to ensure smooth operation and proper error handling?
Bruce (SqlWork.com) 78,006 Reputation points Volunteer Moderator

2024-08-05T19:11:31.24+00:00

using sync orchestration will limit scaling and reliability. you just code the happy path sync, and any error handling as async recovery. this will require two instead of one orchestrator. if you have the sync queue errors, then be sure there is a separate processes that detects any uncompleted requests (sync orchestrator crashed, lost connectivity, etc), queues the error.

Share via

Handling Compensation and Rollback in a Synchronous API-Based Saga Choreography for Microservices

2 answers

Your answer