Robust saga handling

My use-case is that a remote process (JMS consumer) will receive an Axon generated Event (in reality JMS message) which should trigger a Saga so that I can perform the business function robustly.
I’m planning on using AsyncSagaManager with JpaSagaRepository as I want it to be performant and fault tolerant. We cannot afford to loose messages and business function should be robust (sagas shouldn’t disappear).

Couple of question related to failure scenarios:

  1. Looking at AsyncAnnotatedSagaManager:handle implementation it appears that we can loose a Saga (i.e. not persist in DB).
    Steps:

a. We have set JMS consumer acknowledgement to transacted (or client) so that we control when an ack is sent back for a message

b. The container event listener calls the above method - a new saga is created and put in disruptor queue for async processing
c. An ack is sent back because the event listener returned (default behavior of transacted)
d. Machine crashes and the saga (in the queue) is lost since it wasn’t persisted.
e. Also, it appears that the Saga is persisted only after the first event (@StartSaga) is handled which appears to be wrong (vs persist it before invoking the event).

How can the above scenario be made more resilient?

  1. It appears that the Saga is persisted only after an event is handled. Is there a way to persist the state of saga in increments within a single event handler?

  2. If the machine crashes while the sagas were running, does the saga manager restart those running sagas? If not then how is the application supposed to handle this?

Thanks,
Aditya

Hi Aditya,

you’re right, if you use the AsyncSagaManager, there is no 100% guarantee that all events get processed. Even storing intermediate saga state wouldn’t help (it wouldn’t increase any guarantees, just impact performance).

In Axon 2.1 (the commit is almost done), there will be a mechanism where you can wait for acknowledgements on async mechanisms, such as the AsyncSagaManager. You would then acknowledge a message on the jms queue only when the cluster has acknowledged the event. That would give you the full guarantee you require.

For now, using the AnnotatedSagaManager is the only way to get full processing guarantees.

Cheers,

Allard

Hi Allard,

Thanks for the update. I will forward to the 2.1 release.

Also, are there any plans to resume running sagas in case of a process restart? And what is the recommended way to address this concern for now?

Regards,
Aditya

Hi,

the 2.1 release won’t take very long. I’m working on the last few issues.

Saga’s don’t really ‘run’. They handle events and take action. After that action, they are persisted and dormant until the next event comes in. Sagas should not engage in long running activity themselves. Instead, they should just coordinate activity. If a Saga needs to wait for a while (e.g. wait 30 days or until a payment has been made), then you would use an Event scheduler to trigger the saga.

When a system crashes/shuts down, persisted sagas will just go on where they left off processing events when the system is restored.

Cheers,

Allard