I am currently working on a distributed system using axon 3. We have multiple nodes running the same application concurrently, all of which are accessing the same database. The application is configured to use the AsynchronousCommandBus as well as a custom LockFactory implementation (which is using zookeeper to manage locks with a pessimistic lock strategy). The problem we’re running into is that occasionally two nodes will dispatch a command for the same aggregate root at almost the same time. When this happens one node successfully handles the command and persists an event. But the other node fails to process the command due to a BatchUpdateException. The root cause isn’t logged but I’m certain it’s due to a primary key violation exception because if I inspect the domain_event_entry table the event persisted from the first node has the same sequence_number as the event the second node attempted to persist.
So my question is why isn’t the LockManager preventing the nodes from persisting the events sequentially? I would expect it to hold the lock until it’s completely finished processing then release and allow the other node to process the next command with the next sequence_number.
I’m also wondering if removing the LockFactory altogether and switching to the DistributedCommandBus would potentially fix this issue.
Any insights would be much appreciated. Thanks!