Deadline Manager error while sending message to saga

Hi all!

We are running Axonframework app with Quartz Deadline Manager and periodically face with problem:

Aug 24 16:27:30.074 +03:00 somehost service: [EventScheduler_Worker-4-53189][,][ERROR] : Exception occurred during processing a deadline job which will be retried [ru.mcommerce.ewallet.autopay.domain.service.retriever.events.CardInfoRetrievalTimeout] org.axonframework.messaging.ExecutionException: Failed to send a DeadlineMessage for scope [SagaScopeDescriptor for type [CardRegistrationSaga] and identifier [0a014742-f673-4de0-a50d-e904c437129b]] at org.axonframework.deadline.quartz.DeadlineJob.lambda$executeScheduledDeadline$2(DeadlineJob.java:167) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.axonframework.deadline.quartz.DeadlineJob.executeScheduledDeadline(DeadlineJob.java:159) at org.axonframework.deadline.quartz.DeadlineJob.lambda$execute$0(DeadlineJob.java:128) at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:57) at org.axonframework.messaging.unitofwork.DefaultUnitOfWork.executeWithResult(DefaultUnitOfWork.java:74) at org.axonframework.messaging.unitofwork.UnitOfWork.executeWithResult(UnitOfWork.java:328) at org.axonframework.deadline.quartz.DeadlineJob.execute(DeadlineJob.java:134) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) Caused by: org.axonframework.common.lock.LockAcquisitionFailedException: Failed to acquire lock for aggregate identifier(0a014742-f673-4de0-a50d-e904c437129b), maximum attempts exceeded (100) at org.axonframework.common.lock.PessimisticLockFactory$DisposableLock.lock(PessimisticLockFactory.java:269) at org.axonframework.common.lock.PessimisticLockFactory.obtainLock(PessimisticLockFactory.java:122) at org.axonframework.modelling.saga.repository.LockingSagaRepository.lockSagaAccess(LockingSagaRepository.java:81) at org.axonframework.modelling.saga.repository.LockingSagaRepository.load(LockingSagaRepository.java:63) at org.axonframework.modelling.saga.AbstractSagaManager.send(AbstractSagaManager.java:211) at org.axonframework.deadline.quartz.DeadlineJob.lambda$executeScheduledDeadline$2(DeadlineJob.java:161) ... 17 common frames omitted

Only restart of the service helps. Does anyone face same issue?

Versions: JVM 1.8 Axonframework 4.3.5 and Axonserver 4.3.4

Kind regards,
Alexey.

Hi Alexey,

it seems it’s failing to acquire the lock necessary to access a Saga instance. Could it be that there is a long-running process on the Saga that prevents these tags from taking the claim?

Note that the exception mentions “Failed to acquire lock for aggregate”, which is a wrong message. We’ll fix that :). From the rest of the stack trace, it becomes evident this is about Sagas.

Have you seen any other exceptions related to Sagas in your logs?

Allard Buijze

CTO & Founder, AxonIQ

Hi Allard,

thanks for the response. There are some exceptions in logs:

  • Aug 26 13:18:37.613 +03:00 [EventProcessor[CardRegistrationSagaProcessor]-0-22476][,][WARN] : Error occurred. Starting retry mode. org.springframework.transaction.TransactionSystemException: Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection at org.springframework.orm.jpa.JpaTransactionManager.doRollback(JpaTransactionManager.java:564) at org.springframework.transaction.support.AbstractPlatformTransactionManager.processRollback(AbstractPlatformTransactionManager.java:838) at org.springframework.transaction.support.AbstractPlatformTransactionManager.rollback(AbstractPlatformTransactionManager.java:812) at org.axonframework.spring.messaging.unitofwork.SpringTransactionManager.rollbackTransaction(SpringTransactionManager.java:91) at org.axonframework.spring.messaging.unitofwork.SpringTransactionManager$1.rollback(SpringTransactionManager.java:68) at org.axonframework.messaging.unitofwork.UnitOfWork.lambda$attachTransaction$1(UnitOfWork.java:274) at org.axonframework.messaging.unitofwork.MessageProcessingContext.notifyHandlers(MessageProcessingContext.java:71) at org.axonframework.messaging.unitofwork.BatchingUnitOfWork.lambda$notifyHandlers$2(BatchingUnitOfWork.java:155) at java.util.Iterator.forEachRemaining(Iterator.java:116) at org.axonframework.messaging.unitofwork.BatchingUnitOfWork.notifyHandlers(BatchingUnitOfWork.java:155) at org.axonframework.messaging.unitofwork.AbstractUnitOfWork.changePhase(AbstractUnitOfWork.java:222) at org.axonframework.messaging.unitofwork.AbstractUnitOfWork.rollback(AbstractUnitOfWork.java:128) at org.axonframework.messaging.unitofwork.BatchingUnitOfWork.executeWithResult(BatchingUnitOfWork.java:96) at org.axonframework.eventhandling.AbstractEventProcessor.processInUnitOfWork(AbstractEventProcessor.java:159) at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:438) at org.axonframework.eventhandling.TrackingEventProcessor.processingLoop(TrackingEventProcessor.java:301) at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1161) at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.run(TrackingEventProcessor.java:1276) at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1171) at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.run(TrackingEventProcessor.java:1276) at java.lang.Thread.run(Thread.java:748) Caused by: org.hibernate.TransactionException: Unable to rollback against JDBC Connection at org.hibernate.resource.jdbc.internal.AbstractLogicalConnectionImplementor.rollback(AbstractLogicalConnectionImplementor.java:122) at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.rollback(JdbcResourceLocalTransactionCoordinatorImpl.java:294) at org.hibernate.engine.transaction.internal.TransactionImpl.rollback(TransactionImpl.java:145) at org.springframework.orm.jpa.JpaTransactionManager.doRollback(JpaTransactionManager.java:560) … 20 common frames omitted Caused by: java.sql.SQLException: The connection is closed: The connection is closed at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:441) at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:425) at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:308) at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50) at com.sun.proxy.$Proxy204.rollback(Unknown Source) at org.hibernate.resource.jdbc.internal.AbstractLogicalConnectionImplementor.rollback(AbstractLogicalConnectionImplementor.java:116) … 23 common frames omitted
  • Aug 26 13:18:37.613 +03:00 [EventProcessor[CardRegistrationSagaProcessor]-0-22476][,][WARN] : Releasing claim on token and preparing for retry in 1s

I’ve collected thread dump and it shows that saga event processor thread holds several locks, which blocks scheduler threads to acquire lock on saga.

{
“threadName”: “EventProcessor[CardRegistrationSagaProcessor]-0”,
“threadId”: 78,
“blockedTime”: -1,
“blockedCount”: 24,
“waitedTime”: -1,
“waitedCount”: 1579247,
“lockOwnerId”: -1,
“inNative”: false,
“suspended”: false,
“threadState”: “TIMED_WAITING”,
“stackTrace”: [
{
“methodName”: “sleep”,
“fileName”: “Thread.java”,
“lineNumber”: -2,
“className”: “java.lang.Thread”,
“nativeMethod”: true
},
{
“methodName”: “doSleepFor”,
“fileName”: “TrackingEventProcessor.java”,
“lineNumber”: 764,
“className”: “org.axonframework.eventhandling.TrackingEventProcessor”,
“nativeMethod”: false
},
{
“methodName”: “run”,
“fileName”: “TrackingEventProcessor.java”,
“lineNumber”: 1279,
“className”: “org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher”,
“nativeMethod”: false
},
{
“methodName”: “run”,
“fileName”: “TrackingEventProcessor.java”,
“lineNumber”: 1171,
“className”: “org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker”,
“nativeMethod”: false
},
{
“methodName”: “run”,
“fileName”: “TrackingEventProcessor.java”,
“lineNumber”: 1276,
“className”: “org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher”,
“nativeMethod”: false
},
{
“methodName”: “run”,
“fileName”: “Thread.java”,
“lineNumber”: 748,
“className”: “java.lang.Thread”,
“nativeMethod”: false
}
],
“lockedMonitors”: [],
“lockedSynchronizers”: [
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 1291360426
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 1449345998
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 1475016286
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 336590891
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 954289854
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 1951401823
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 1946699661
},
{
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 523057031
}
]
},
{
“threadName”: “EventScheduler_Worker-6”,
“threadId”: 36,
“blockedTime”: -1,
“blockedCount”: 1786,
“waitedTime”: -1,
“waitedCount”: 317635,
“lockName”: “java.util.concurrent.locks.ReentrantLock$NonfairSync@566343ce”,
“lockOwnerId”: 78,
“lockOwnerName”: “EventProcessor[CardRegistrationSagaProcessor]-0”,
“inNative”: false,
“suspended”: false,
“threadState”: “TIMED_WAITING”,
“stackTrace”: [
{
“methodName”: “park”,
“fileName”: “Unsafe.java”,
“lineNumber”: -2,
“className”: “sun.misc.Unsafe”,
“nativeMethod”: true
},
{
“methodName”: “parkNanos”,
“fileName”: “LockSupport.java”,
“lineNumber”: 215,
“className”: “java.util.concurrent.locks.LockSupport”,
“nativeMethod”: false
},
{
“methodName”: “doAcquireNanos”,
“fileName”: “AbstractQueuedSynchronizer.java”,
“lineNumber”: 934,
“className”: “java.util.concurrent.locks.AbstractQueuedSynchronizer”,
“nativeMethod”: false
},
{
“methodName”: “tryAcquireNanos”,
“fileName”: “AbstractQueuedSynchronizer.java”,
“lineNumber”: 1247,
“className”: “java.util.concurrent.locks.AbstractQueuedSynchronizer”,
“nativeMethod”: false
},
{
“methodName”: “tryLock”,
“fileName”: “ReentrantLock.java”,
“lineNumber”: 442,
“className”: “java.util.concurrent.locks.ReentrantLock”,
“nativeMethod”: false
},
{
“methodName”: “lock”,
“fileName”: “PessimisticLockFactory.java”,
“lineNumber”: 272,
“className”: “org.axonframework.common.lock.PessimisticLockFactory$DisposableLock”,
“nativeMethod”: false
},
{
“methodName”: “obtainLock”,
“fileName”: “PessimisticLockFactory.java”,
“lineNumber”: 122,
“className”: “org.axonframework.common.lock.PessimisticLockFactory”,
“nativeMethod”: false
},
{
“methodName”: “lockSagaAccess”,
“fileName”: “LockingSagaRepository.java”,
“lineNumber”: 81,
“className”: “org.axonframework.modelling.saga.repository.LockingSagaRepository”,
“nativeMethod”: false
},
{
“methodName”: “load”,
“fileName”: “LockingSagaRepository.java”,
“lineNumber”: 63,
“className”: “org.axonframework.modelling.saga.repository.LockingSagaRepository”,
“nativeMethod”: false
},
{
“methodName”: “send”,
“fileName”: “AbstractSagaManager.java”,
“lineNumber”: 211,
“className”: “org.axonframework.modelling.saga.AbstractSagaManager”,
“nativeMethod”: false
},
{
“methodName”: “lambda$executeScheduledDeadline$2”,
“fileName”: “DeadlineJob.java”,
“lineNumber”: 161,
“className”: “org.axonframework.deadline.quartz.DeadlineJob”,
“nativeMethod”: false
},
{
“methodName”: “accept”,
“lineNumber”: -1,
“className”: “org.axonframework.deadline.quartz.DeadlineJob$$Lambda$2089/1733249645”,
“nativeMethod”: false
},
{
“methodName”: “accept”,
“fileName”: “ForEachOps.java”,
“lineNumber”: 184,
“className”: “java.util.stream.ForEachOps$ForEachOp$OfRef”,
“nativeMethod”: false
},
{
“methodName”: “accept”,
“fileName”: “ReferencePipeline.java”,
“lineNumber”: 175,
“className”: “java.util.stream.ReferencePipeline$2$1”,
“nativeMethod”: false
},
{
“methodName”: “forEachRemaining”,
“fileName”: “ArrayList.java”,
“lineNumber”: 1374,
“className”: “java.util.ArrayList$ArrayListSpliterator”,
“nativeMethod”: false
},
{
“methodName”: “copyInto”,
“fileName”: “AbstractPipeline.java”,
“lineNumber”: 481,
“className”: “java.util.stream.AbstractPipeline”,
“nativeMethod”: false
},
{
“methodName”: “wrapAndCopyInto”,
“fileName”: “AbstractPipeline.java”,
“lineNumber”: 471,
“className”: “java.util.stream.AbstractPipeline”,
“nativeMethod”: false
},
{
“methodName”: “evaluateSequential”,
“fileName”: “ForEachOps.java”,
“lineNumber”: 151,
“className”: “java.util.stream.ForEachOps$ForEachOp”,
“nativeMethod”: false
},
{
“methodName”: “evaluateSequential”,
“fileName”: “ForEachOps.java”,
“lineNumber”: 174,
“className”: “java.util.stream.ForEachOps$ForEachOp$OfRef”,
“nativeMethod”: false
},
{
“methodName”: “evaluate”,
“fileName”: “AbstractPipeline.java”,
“lineNumber”: 234,
“className”: “java.util.stream.AbstractPipeline”,
“nativeMethod”: false
},
{
“methodName”: “forEach”,
“fileName”: “ReferencePipeline.java”,
“lineNumber”: 418,
“className”: “java.util.stream.ReferencePipeline”,
“nativeMethod”: false
},
{
“methodName”: “executeScheduledDeadline”,
“fileName”: “DeadlineJob.java”,
“lineNumber”: 159,
“className”: “org.axonframework.deadline.quartz.DeadlineJob”,
“nativeMethod”: false
},
{
“methodName”: “lambda$execute$0”,
“fileName”: “DeadlineJob.java”,
“lineNumber”: 128,
“className”: “org.axonframework.deadline.quartz.DeadlineJob”,
“nativeMethod”: false
},
{
“methodName”: “handle”,
“lineNumber”: -1,
“className”: “org.axonframework.deadline.quartz.DeadlineJob$$Lambda$2081/1117718094”,
“nativeMethod”: false
},
{
“methodName”: “proceed”,
“fileName”: “DefaultInterceptorChain.java”,
“lineNumber”: 57,
“className”: “org.axonframework.messaging.DefaultInterceptorChain”,
“nativeMethod”: false
},
{
“methodName”: “call”,
“lineNumber”: -1,
“className”: “org.axonframework.deadline.quartz.DeadlineJob$$Lambda$2082/1184930501”,
“nativeMethod”: false
},
{
“methodName”: “executeWithResult”,
“fileName”: “DefaultUnitOfWork.java”,
“lineNumber”: 74,
“className”: “org.axonframework.messaging.unitofwork.DefaultUnitOfWork”,
“nativeMethod”: false
},
{
“methodName”: “executeWithResult”,
“fileName”: “UnitOfWork.java”,
“lineNumber”: 328,
“className”: “org.axonframework.messaging.unitofwork.UnitOfWork”,
“nativeMethod”: false
},
{
“methodName”: “execute”,
“fileName”: “DeadlineJob.java”,
“lineNumber”: 134,
“className”: “org.axonframework.deadline.quartz.DeadlineJob”,
“nativeMethod”: false
},
{
“methodName”: “run”,
“fileName”: “JobRunShell.java”,
“lineNumber”: 202,
“className”: “org.quartz.core.JobRunShell”,
“nativeMethod”: false
},
{
“methodName”: “run”,
“fileName”: “SimpleThreadPool.java”,
“lineNumber”: 573,
“className”: “org.quartz.simpl.SimpleThreadPool$WorkerThread”,
“nativeMethod”: false
}
],
“lockedMonitors”: [
{
“className”: “java.lang.Object”,
“identityHashCode”: 980210039,
“lockedStackDepth”: 30,
“lockedStackFrame”: {
“methodName”: “run”,
“fileName”: “SimpleThreadPool.java”,
“lineNumber”: 573,
“className”: “org.quartz.simpl.SimpleThreadPool$WorkerThread”,
“nativeMethod”: false
}
}
],
“lockedSynchronizers”: [],
“lockInfo”: {
“className”: “java.util.concurrent.locks.ReentrantLock$NonfairSync”,
“identityHashCode”: 1449345998
}
}

Kind regards,
Alexey.

четверг, 27 августа 2020 г. в 17:57:15 UTC+3, Allard Buijze:

Hi Alexey,

in anticipation to your response, I have been scouting for a code path where it could be possible that a lock wasn’t properly released.

I found one, and it was slightly beyond the obvious. Apparently, it is possible that when a lifecycle handler for a Unit of Work throws an exception, that it fails to invoke some other handlers. One of these handlers could be the one that releases the lock.

The fact that you have the exception about a failing rollback is an indication that this may indeed have been the case for you, too.

We’ve also had reports of connection leaks which were very hard to debug. I suspect they may have been caused by the same underlying issue.

The fix was simple and is scheduled to be included in the 4.4.3 release. We won’t wait too long before release it.