Custom LockFactory on distributed system results in primary key violation

Ben_Walford · February 1, 2018, 10:49pm

Hi all,

I am currently working on a distributed system using axon 3. We have multiple nodes running the same application concurrently, all of which are accessing the same database. The application is configured to use the AsynchronousCommandBus as well as a custom LockFactory implementation (which is using zookeeper to manage locks with a pessimistic lock strategy). The problem we’re running into is that occasionally two nodes will dispatch a command for the same aggregate root at almost the same time. When this happens one node successfully handles the command and persists an event. But the other node fails to process the command due to a BatchUpdateException. The root cause isn’t logged but I’m certain it’s due to a primary key violation exception because if I inspect the domain_event_entry table the event persisted from the first node has the same sequence_number as the event the second node attempted to persist.

So my question is why isn’t the LockManager preventing the nodes from persisting the events sequentially? I would expect it to hold the lock until it’s completely finished processing then release and allow the other node to process the next command with the next sequence_number.

I’m also wondering if removing the LockFactory altogether and switching to the DistributedCommandBus would potentially fix this issue.

Any insights would be much appreciated. Thanks!

Steven_van_Beelen · February 2, 2018, 10:45am

Hi Ben,

Using the DistributedCommandBus will solve your issue here, is it will use a RoutingStrategy to determine who’s going to handle a command.

The default RoutingStrategy used is the AnnotationRoutingStrategy, which in turns defaults to pulling the @TargetAggregateIdentifier from the command message and using it as the routing key.

That set up will thus guarantee that there’s always one node which will handle all the commands for a given aggregate.

That in turn should ensure that no events with identical sequence numbers for a given aggregate will be inserted.

Hope this helps you out!

Cheers,

Steven

Ben_Walford · February 2, 2018, 7:39pm

Thanks Steven. I am certainly going to start looking into using the DistributedCommandBus. It sounds like that would be the best solution. However, I’m still curious as to why the LockFactory implementation isn’t working. Shouldn’t a lock be getting acquired during command/event processing? Mostly I ask because we have multiple other services using axon 2 with the same zookeeper backed locking mechanism yet we don’t have this issue.

Steven_van_Beelen · February 5, 2018, 11:58am

Hi Ben,

Glad to have been of help in that part! Please feel free to drop any additional questions if you have them.

For the LockFactory part: a Lock for an Aggregate is required once an aggregate is loaded from the Repository, since it’s the LockingRepository implementation which receives the LockFactory impl.
That said, it’s thus true that the command handling part is locked by one instance.

If the event processing part is also locked, depends on whether the Event Processors you use are called in the same thread as the command handling section.

That said though, I’m not sure why your Custom LockFactory would fail in some scenarios since you’ve upgraded to Axon 3.

I am however fairly sure I haven’t seen/heard of such a scenario with Axon Frameworks internal LockFactory implementations…

Maybe somebody else with more insights in locking approaches (together with Zookeeper) could give you a hint, but I’m (sadly) a bit in the dark here.

Cheers,

Steven

Ben_Walford · February 6, 2018, 4:48pm

Hi Steven,

Knowing that the locking should have been happening made me very skeptical that the application configuration was working as I expected, which was indeed the problem. The project is using spring boot so when we moved to Axon 3 we provided less explicit configuration and instead it’s relying on spring auto config. It turns out providing a lock manager in the context was not sufficient. Axon was actually creating is own instance of the lock manager opposed to using the bean in the context. Once I added explicit configuration for an EventSourcingRepository using the ZK lock manager everything worked as expected.

Thanks for the help!
Ben

Steven_van_Beelen · February 8, 2018, 8:44am

Hi Ben,

Ah good, great to hear you’ve figured it out!

Cheers,

Steven