Inconsistent saga state with multiple event handlers

I turned off async saga event handling pending a fix for the open-connection problem I reported earlier. But now I’m seeing a different problem.

One of my sagas has a collection of pending interactions with an external system. When one of the interactions finishes, an event is published. The saga handles that event by cleaning up some internal state that tracked the interaction and sending a command to an event-sourced aggregate, which publishes a second event. That second event is consumed both by the aggregate and by the same saga that sent the command.

What I’m seeing is that the saga manager isn’t reusing the existing saga object to handle the second event; instead it loads a fresh copy of the saga from the repository. Unfortunately, in the non-async world, at that point the changes the original saga made to its internal state haven’t been committed to the repository yet, so handling of the second event ends up going wrong because it’s delivered to a saga instance whose state still indicates that the interaction was pending.

It looks like this doesn’t happen in cases where the first event caused the saga to be created, because there’s special handling for newly-created sagas in AbstractSagaManager.loadAndInvoke() and the existing object is reused. But in this case the saga already exists, so it’s loaded from scratch for each event.

In concrete terms, when I put a breakpoint in the handler for the second event and walk up the call stack, I see that the saga instance at the bottom of the stack is a different Java object than the one at the top of the stack; they have the same identifier but different state.

As usual, if this is caused by me using Axon wrong rather than by a bug, happy to adjust my application code.

-Steve

I’ve sent a pull request with a possible fix for this problem. The fix is pretty simple, though it’s conceivable I’m introducing a behavior change that’ll cause problems in a distributed setup. (I don’t think that’s true because appropriate locks should be held.) It takes care of the problem in my environment, so this is no longer a pressing problem for me.

Hi Steven,

since this is a very common usecase, something else might be going wrong. There is a mechanism in Axon (in the UnitOfWork implementations) that ensure that events are handled in the correct order. If an event is delivered at a Saga instance while another event is being handled there, it probably means this mechanism is circumvented.

How do you publish the event that comes as a result of the command sent by the Saga?

Cheers,

Allard

In the course of writing up a response, I realized what the problem was: I was calling EventTemplate.publish() from a service class without starting a UnitOfWork first. Once I added that, Axon started deferring event handling until after the Saga was committed (rather than doing it inline before the first event handler had returned) and the inconsistent state went away. Chalk one up for user error; thanks for the quick reply.

-Steve