Reliable distributed transactions with Saga

Hello,

Let me share with you the following simple distributed system with 2 Microservices, where first is running Saga A, the other one handles command X.

[Microservice 1]
Saga A:
– on EventA

dispatch Command X

– on Event B
END SAGA

[Microservice 2]
Aggregate X:
– Command X handler

Any of those microservices can be stopped and started at any time by the cloud environment for a number of reasons. In such a situation we want to implement reliable distributed transactions using Saga. Unfortunately there are failures, which leads to unfinished transactions and leaves zombi Saga instances.

Lets see what are possible infrastructure (non business logic related) failure points:

  1. Microservice 2 is not up at the time of sending command X from Saga A

  2. To limit the number of failures from point 1, we set the command gateway retry mechanism. Retry command dispatching works well, if Microservice 2 goes up before Microservice 1 goes down for some reason. As this is not guaranteed at all I’m not considering the command retry mechanism as a solution in this case.

  3. Another case, which can lead to a broken Saga is when Saga A/on Event A handler starts execution and Microservice A is stopped or gets killed before Command X is dispatched

In any of those 3 failures we result in broken (not finished) transactions and Saga zombie. At the time, to work around this we set command dispatch retry and try to avoid microservices to be stopped or killed as much as we can. We set up a script, which periodically checks and reports Saga zombies, so someone can fix it manually. All of those are not solutions, but some kind of workarounds. I believe as those are infrastructure related issues, they should be resolved by framework (Axon) so it would not be needed to handle them into Saga - for example put some deadlines, use long living Sagas, etc.

As far as I know Axon framework does not persist commands for sending and does not guarantee reliable dispatching. I would like to ask you for best practices and solutions in the Axon framework, which I may have missed.

Best Regards,
Paco