I have an aggregate and a saga. The saga collects info from the aggregate via events and then schedules a Quartz job, and also informs the aggregate about its status via commands. (The Quartz job triggers the saga to interact with an external system, but I don’t think that’s relevant here.)
The deadlock is between these two threads:
- [user] -> Command -> Aggregate* -> Domain Event -> Saga* -> EventScheduler
- [Quartz job] -> Event -> Saga* -> Command -> Aggregate*
- = IdentifierBasedLock.obtainLock() is called at these points
As you can see, we have two locks acquired in different orders.
I think IdentifierBasedLock would detect the detect and throw a DeadlockException, except that my CachingEventSourcingRepository (hence PessimisticLockStrategy) and AbstractSagaManager have different IdentifierBasedLock instances, so each is not aware of the other’s locks’ owning threads.
Would it be possible to change IdentifierBasedLock.locks hashmap to be ‘static’? Or make it possible to wire the same IdentifierBasedLock into both places?
Other options I see:
- I’m reluctant to use an async saga manager because of its non-persistence in case of server shutdown.
- I think this would be fixed by using an AsynchronousCommandBus, which is a possibility, but also kind of a scary change since it means no more nested units of work, and I’d need to be careful about timeouts…I’d have to analyze the whole application to see how that might affect things.
- Disabling ‘synchronizedSagaAccess’ probably doesn’t help because I’d have to litter my saga with ‘synchronized’ blocks, leading to the same potential deadlock.
(Note, I found another lock in CachingSagaRepository.associationsCacheLock…It looks like this one is not subject to deadlock because it never calls outside from within the critical section, but it’s worth considering too.)
I can provide a thread dump if needed.
I found a couple threads in this group about deadlocks but they didn’t seem related.
- https://groups.google.com/d/msg/axonframework/j9uxqz0Jsfc/x_a_0hpcWfYJ was due to a single command modifying (and locking) multiple aggregates – not the case here
- https://groups.google.com/d/msg/axonframework/ZBP1yQZaPOQ/VGZn_H7Ycd0J was due to a DB locking issue