0
The challenge we're facing is how to handle failures and rollbacks. For example, if the payment process fails after the inventory has been updated, we need to undo the stock update to maintain consistency. We're trying to solve this problem using a saga choreography approach, where each service communicates with the others through synchronous API calls.
The main difficulty is ensuring that if a compensating action, like refunding a payment, fails, we can still roll back other actions (like updating inventory) without manual intervention. We can't use message queues or other asynchronous mechanisms, which makes it harder to coordinate these compensations and prevent inconsistencies. We're looking for a solution that can handle these situations automatically and reliably, without creating a single point of failure.
We tried implementing the saga choreography pattern with synchronous API calls between our microservices (Order Service, Inventory Service, and Payment Service). The idea was that each service would call the next one in line and handle compensations if something went wrong.
Here's what we did:
Order Placement: The Order Service created an order and called the Inventory Service to update the stock. Stock Update: The Inventory Service decremented the stock and then called the Payment Service to process the payment. Payment Processing: The Payment Service processed the payment and confirmed the transaction. We expected that if any step failed, the services would call each other to perform compensating actions. For example, if the payment failed, the Inventory Service would be called to revert the stock update.
What We Expected:
Automatic Compensation: We expected the system to automatically roll back previous actions if something went wrong, ensuring data consistency across all services. Reliability: We aimed for a reliable process without manual intervention, even in cases of failure, and without introducing a single point of failure. However, we encountered issues with handling compensation transactions, especially when they failed after multiple retries. This made it challenging to maintain consistency without manual intervention, and we needed a way to ensure that compensations (like rolling back inventory updates) could be reliably performed.