Hi all, long time don’t post anything…
We have a app that is using Axon 2.4 for quite sometime with no big problems with Axon, but now we deployed another instance of the same application whose big diference is the amount of commands/events sent to Axon, that is quite bigger in this second instance. And as it happens we are seeing lots of failures do to Events not being handled in the order they were dispatched.
Let me describe the situation: we have 1 “service” aggregate, 1 “service” saga and 1 “task” aggregate.
The “task” aggregate (Tasks) is actually a (parcial) implementation of the WS-HumanTask Specification, so it has a very formal workflow/states/transitions.
The “service” aggregate (Service) it’s our business service. The “service” saga (Saga), among other things, is responsible for the coordination between the Service and the Tasks.
So, a typical flow will be:
(start command) --> Service --> (started event) --> Saga --> START --> Tasks (state: READY --> IN_PROGRESS)
(cancel command) --> Service --> (cancelled event) --> Saga --> RELEASE --> Tasks (state: IN_PROGRESS --> READY)
or
(start command) --> Service --> (started event) --> Saga --> START --> Tasks (state: READY --> IN_PROGRESS)
(commit command) --> Service --> (closed event) --> Saga --> COMPLETE --> Tasks (state: IN_PROGRESS --> COMPLETED)
Commands to the Tasks fail if the state of the Task is not the one expected in the WS-HT workflow, so for instance
(start command) --> Service --> (started event) --> Saga --> START --> Tasks (state: READY --> IN_PROGRESS)
(start command) --> Service --> (started event) --> Saga --> START --> Tasks (state: FAILED
or
(start command) --> Service --> (started event) --> Saga --> START --> Tasks (state: READY --> IN_PROGRESS)
(cancel command) --> Service --> (cancelled event) --> Saga --> RELEASE --> Tasks (state: IN_PROGRESS --> READY)
(commit command) --> Service --> (closed event) --> Saga --> COMPLETE --> Tasks (state: FAILED)
Now, what’s happening is that for some reason if the UA sends very quickly a (start command) + (cancel command) sometimes the RELEASE command get’s to the Tasks before the START command, so effectively we have
(start command) --> Service --> (started event) --> Saga (do some work here before sending the HT command)
(cancel command) --> Service --> (cancelled event) --> Saga --> RELEASE --> Tasks (state: FAILED - it’s in READY instead of IN_PROGRESS)
Saga (finishes the work) --> START --> Tasks (state: FAILED - does nothing actually because it was already FAILED)
We did changed the CommandBus from a SimpleCommandBus to a AsynchronousCommandBus because we also have a batch job that is quite heavy on the service aggregate and consumed all the DB connections very quickly.
So, even if more than probably this change is the culprit, I don’t have a clue on how to avoid this problem without going back to the SimpleCommandBus that will cause the DB connections problems again. So it seems I’m between a rock and a hard place.
Any help/advise is greatly appreciated.
Cheers.