PessimisticLockFactory throws LockAcquisitionFailedException for deadline triggers

Greetings,

Some background: we have an aggregate that represents a bucket of work to be done. This bucket has a quota of how many tasks it can perform per day. Every day, the quota refreshes, so that the bucket may perform more work that day.

To refresh the bucket size, a new deadline is scheduled every time a refresh deadline runs. Effectively, it is just a never-ending loop of: schedule deadline -> deadline triggered -> refresh quota -> schedule deadline.

We run two instances of our application in production for high availability (but both run against the same DB). Recently, one of them started throwing the following error (and has been doing so for quite a few days now)

org.axonframework.common.lock.LockAcquisitionFailedException: Failed to acquire lock for aggregate identifier(AGG_ID), maximum attempts exceeded (2147483647)

I originally thought, because we run two nodes in production, that somehow node A had acquired a DB lock that node B needed, and node A was not releasing the lock, but then realized that it is not even making any trips to the DB. Rather, node A is unable to acquire the lock when it calls the PessimisticLockFactory.DisposableLock#lock() method, which would indicate that there is an org.axonframework.common.lock.Lock that it cannot acquire, within the same JVM instance.

Later today, I will try and scale production down to just 1 node, to see if that frees the lock.

In the meantime, I was wondering if anyone had seen this happen before. I noticed that the DisposableLock class is an inner class, but that it is not static. I am wondering if there is an instance of the PessimisticLockFactory that should have gotten cleaned up, but didn’t?

Below is the full stack trace (partly cut off due to how new relic displays it, but I think you can still make out what package each class is from):

caused by org.axonframework.messaging.ExecutionException: Failed to send a DeadlineMessage for scope [AggregateScopeDescriptor for type [WorkBasket] and identifier [AGG_ID]]

…e.quartz.DeadlineJob.lambda$executeScheduledDeadline$2 (DeadlineJob.java:167)

java.util.stream.ForEachOps$ForEachOp$OfRef.accept (ForEachOps.java:183)

java.util.stream.ReferencePipeline$2$1.accept (ReferencePipeline.java:175)

…a.util.ArrayList$ArrayListSpliterator.forEachRemaining (ArrayList.java:1382)

java.util.stream.AbstractPipeline.copyInto (AbstractPipeline.java:482)

java.util.stream.AbstractPipeline.wrapAndCopyInto (AbstractPipeline.java:472)

…va.util.stream.ForEachOps$ForEachOp.evaluateSequential (ForEachOps.java:150)

…l.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential (ForEachOps.java:173)

java.util.stream.AbstractPipeline.evaluate (AbstractPipeline.java:234)

java.util.stream.ReferencePipeline.forEach (ReferencePipeline.java:485)

…k.deadline.quartz.DeadlineJob.executeScheduledDeadline (DeadlineJob.java:159)

…framework.deadline.quartz.DeadlineJob.lambda$execute$0 (DeadlineJob.java:128)

…xonframework.messaging.DefaultInterceptorChain.proceed (DefaultInterceptorChain.java:57)

…ssaging.interceptors.CorrelationDataInterceptor.handle (CorrelationDataInterceptor.java:65)

…xonframework.messaging.DefaultInterceptorChain.proceed (DefaultInterceptorChain.java:55)

…ssaging.unitofwork.DefaultUnitOfWork.executeWithResult (DefaultUnitOfWork.java:74)

…work.messaging.unitofwork.UnitOfWork.executeWithResult (UnitOfWork.java:328)

org.axonframework.deadline.quartz.DeadlineJob.execute (DeadlineJob.java:134)

org.quartz.core.JobRunShell.run (JobRunShell.java:202)

org.quartz.simpl.SimpleThreadPool$WorkerThread.run (SimpleThreadPool.java:573)

caused by org.axonframework.common.lock.LockAcquisitionFailedException: Failed to acquire lock for aggregate identifier(AGG_ID), maximum attempts exceeded (2147483647)

…common.lock.PessimisticLockFactory$DisposableLock.lock (PessimisticLockFactory.java:266)

…ramework.common.lock.PessimisticLockFactory.obtainLock (PessimisticLockFactory.java:119)

…onframework.modelling.command.LockingRepository.doLoad (LockingRepository.java:116)

…onframework.modelling.command.LockingRepository.doLoad (LockingRepository.java:52)

…ork.modelling.command.AbstractRepository.lambda$load$4 (AbstractRepository.java:122)

java.util.HashMap.computeIfAbsent (HashMap.java:1127)

…xonframework.modelling.command.AbstractRepository.load (AbstractRepository.java:121)

…xonframework.modelling.command.AbstractRepository.load (AbstractRepository.java:146)

…xonframework.modelling.command.AbstractRepository.send (AbstractRepository.java:297)

…e.quartz.DeadlineJob.lambda$executeScheduledDeadline$2 (DeadlineJob.java:161)

java.util.stream.ForEachOps$ForEachOp$OfRef.accept (ForEachOps.java:183)

java.util.stream.ReferencePipeline$2$1.accept (ReferencePipeline.java:175)

…a.util.ArrayList$ArrayListSpliterator.forEachRemaining (ArrayList.java:1382)

java.util.stream.AbstractPipeline.copyInto (AbstractPipeline.java:482)

java.util.stream.AbstractPipeline.wrapAndCopyInto (AbstractPipeline.java:472)

…va.util.stream.ForEachOps$ForEachOp.evaluateSequential (ForEachOps.java:150)

…l.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential (ForEachOps.java:173)

java.util.stream.AbstractPipeline.evaluate (AbstractPipeline.java:234)

java.util.stream.ReferencePipeline.forEach (ReferencePipeline.java:485)

…k.deadline.quartz.DeadlineJob.executeScheduledDeadline (DeadlineJob.java:159)

…framework.deadline.quartz.DeadlineJob.lambda$execute$0 (DeadlineJob.java:128)

…xonframework.messaging.DefaultInterceptorChain.proceed (DefaultInterceptorChain.java:57)

…ssaging.interceptors.CorrelationDataInterceptor.handle (CorrelationDataInterceptor.java:65)

…xonframework.messaging.DefaultInterceptorChain.proceed (DefaultInterceptorChain.java:55)

…ssaging.unitofwork.DefaultUnitOfWork.executeWithResult (DefaultUnitOfWork.java:74)

…work.messaging.unitofwork.UnitOfWork.executeWithResult (UnitOfWork.java:328)

org.axonframework.deadline.quartz.DeadlineJob.execute (DeadlineJob.java:134)

org.quartz.core.JobRunShell.run (JobRunShell.java:202)

org.quartz.simpl.SimpleThreadPool$WorkerThread.run (SimpleThreadPool.java:573)

Additional observation, it looks like the PessimisticLockFactory is (eventually) used by one of the following:

AnnotatedSagaRepository, EventSourcingRepository, GenericJpaRepository

Has anyone ever seen that one of these can obtain a PessimisticLockFactory$DisposableLock, and then sit on that lock indefinitely? That seems the likely reason why when the quartz job wakes up, it is unable to get the lock that is needs from the PessimisticLockFactory in order to perform the quota refresh on the aggregate.

Hi David,

sorry for the late reply. Which version are you using? We have seen it in the very distant past, where there was an issue in the Unit of Work. That was resolve in the early 3.2 versions, if I recall correctly. Nowadays, we can typically trace these back to an application level deadlock, where a send and wait in a Saga, for example, causes some issues.

Do you see this happening frequently? If so, could you create a thread dump at the time this occurs?