Failed to acquire lock for aggregate identifier

Michael_Kofman · January 28, 2019, 6:37pm

Why does this happen when sending a command?

Full exception:

`

org.axonframework.common.lock.LockAcquisitionFailedException: Failed to acquire lock for aggregate identifier(f85c6992-3f4d-4dae-afe8-53c071d063b6), maximum attempts exceeded (2147483647)
at org.axonframework.common.lock.PessimisticLockFactory$DisposableLock.lock(PessimisticLockFactory.java:266) ~[axon-messaging-4.0.jar:4.0]
at org.axonframework.common.lock.PessimisticLockFactory.obtainLock(PessimisticLockFactory.java:119) ~[axon-messaging-4.0.jar:4.0]
at org.axonframework.modelling.command.LockingRepository.doLoad(LockingRepository.java:116) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.modelling.command.LockingRepository.doLoad(LockingRepository.java:52) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.modelling.command.AbstractRepository.lambda$load$4(AbstractRepository.java:116) ~[axon-modelling-4.0.jar:4.0]
at java.util.HashMap.computeIfAbsent(HashMap.java:1127) ~[na:1.8.0_162]
at org.axonframework.modelling.command.AbstractRepository.load(AbstractRepository.java:115) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.modelling.command.AggregateAnnotationCommandHandler$AggregateCommandHandler.handle(AggregateAnnotationCommandHandler.java:364) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.modelling.command.AggregateAnnotationCommandHandler$AggregateCommandHandler.handle(AggregateAnnotationCommandHandler.java:352) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.modelling.command.AggregateAnnotationCommandHandler.handle(AggregateAnnotationCommandHandler.java:140) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.modelling.command.AggregateAnnotationCommandHandler.handle(AggregateAnnotationCommandHandler.java:53) ~[axon-modelling-4.0.jar:4.0]
at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:57) ~[axon-messaging-4.0.jar:4.0]
at org.axonframework.messaging.interceptors.CorrelationDataInterceptor.handle(CorrelationDataInterceptor.java:65) ~[axon-messaging-4.0.jar:4.0]
at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:55) ~[axon-messaging-4.0.jar:4.0]
at org.axonframework.messaging.unitofwork.DefaultUnitOfWork.executeWithResult(DefaultUnitOfWork.java:74) ~[axon-messaging-4.0.jar:4.0]
at org.axonframework.commandhandling.SimpleCommandBus.handle(SimpleCommandBus.java:176) [axon-messaging-4.0.jar:4.0]
at org.axonframework.commandhandling.SimpleCommandBus.doDispatch(SimpleCommandBus.java:141) [axon-messaging-4.0.jar:4.0]
at org.axonframework.commandhandling.SimpleCommandBus.dispatch(SimpleCommandBus.java:110) [axon-messaging-4.0.jar:4.0]
at org.axonframework.commandhandling.gateway.AbstractCommandGateway.send(AbstractCommandGateway.java:75) [axon-messaging-4.0.jar:4.0]
at org.axonframework.commandhandling.gateway.DefaultCommandGateway.send(DefaultCommandGateway.java:78) [axon-messaging-4.0.jar:4.0]
at com.aramark.ess.command.domain.SyncScheduler$SyncExecutor.run(SyncScheduler.kt:69) [classes/:na]
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.0.9.RELEASE.jar:5.0.9.RELEASE]
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93) [spring-context-5.0.9.RELEASE.jar:5.0.9.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_162]
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) [na:1.8.0_162]
at java.util.concurrent.FutureTask.run(FutureTask.java) [na:1.8.0_162]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_162]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_162]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_162]

`

allardbz · January 29, 2019, 9:54am

This exception is raised when Axon has failed to acquire a lock for an aggregate. There might be another process that holds the lock for too long, causing the lock acquisition to time out.

By default, the LockFactory does 100 attempts of 600ms each to acquire a lock (checking for deadlocks in between). That means if it fails, that something has been holding that lock for more than 60 seconds.
While these defaults can be changed, the problem is typically elsewhere.

Cheers,

Bjorn_Harvold · March 14, 2019, 12:04pm

Hi Team Axon!

Today, this is exactly what happened in our production environment for the first time in 5 years. We receive a steady stream of large and small update requests from our integration partners. We received a message from GCP that our app server just had a java heap out of memory exception but we didn’t know what caused it. We restarted the instance only to have it immediately run out of memory after one of our partners sent us a REST payload. We are used to the size of these payloads for a long time already. We looked deeper into what was causing the out of memory errors. For whatever reason, today Axon was not happy and was showing off this exception. But we didn’t see this exception until we increased app server memory so I have a feeling that the code responsible for locking on to the aggregate carries with it a performance hit. We have 12 GB RAM on each app server and profiling our application does not show us any memory leaks [that we have been able to find so far].

So for some reason, we received locking exceptions for payloads headed to a specific aggregate identifier. The server shut down after an out of memory error due to the LockException mentioned in this thread. We doubled the RAM and we no longer experiences out of memory errors, but ew could see the locking exception occurring at times.

Would be interested to know how to better figure out how to avoid these exceptions. We get a steady stream of events coming in through our REST endpoints and Web Services. The payloads have never been an issue for 5 years and they have not changed. The aggregate it’s occurring on is one with a list on it that can increase and decrease in size. Potentially, if it received 1000s of records in multiple payloads at the same time could there be some CPU processing time necessary to add or remove list entries so when the second request came in it hadn’t finished yet. I was just not able to see that in the log. There was no difference in the size of our payloads today and our commands execute super fast anyway.

Thoughts?

Bjorn_Harvold · March 14, 2019, 12:22pm

To follow up on this thread with a bit more information.

This aggregate is very frequently updated. I can see one entry for this specific aggregate identifier in our snapshot repo.

Attached is the monitoring log for our MongoDb event store on GCP. Trying to attach a png, gif or jpg but won’t let me. Dropbox link here: Dropbox - File Deleted - Simplify your life and here: Dropbox - File Deleted - Simplify your life

The event store was very busy at the time the server went down (10:00 am) which might be due to the locking exceptions occurring. Just not sure what it was doing at that time. There are regular spikes on this chart which are due to integration partners sending us data. Should the spikes be that high?

Any thoughts on this is appreciated

Bjorn_Harvold · March 15, 2019, 3:04am

Another follow-up is needed here:

The high CPU usage on the db instance was due to a backup process occurring at the same time so I do not think it was due to anything Axon was doing. However, the failure to get the lock and the app server consequently running out of memory I think is related. Would be interesting to have an integration test that can test how much such an exception costs in terms of memory.

Steven_van_Beelen · April 1, 2019, 2:57pm

Hi Bjorn

Interesting issue you’ve run into with your team there.
To be fair, it’s quite though to deduce what the exact issue is why this happens in your environment without knowing all the parameters which your system consists off.

My first hunches lean towards:

The Aggregate being accessed consists of so much data, that whilst it’s being event sourced, you run out of memory.
It’s a combination of this one Aggregate being very big and a ton of commands being dispatched towards it.
Thus, things I am missing to deduce whether this is sensible, are the Axon Framework version, whether you have configured snapshotting, the number of events (and maybe snapshots) this one aggregate consists of and the total number of commands being dispatched in quick succession.

Additionally, are you by any chance configuring the PessimisticLockFactory yourself too?
As off Axon 3.4 it’s possible to configure fields on the LockFactory (from which the ‘LockAcquisitionFailedException’ originates) like the number of attempts to try to acquire a lock, the maximum number of queued commands (this might be useful in your situation!) and the lock attempt timeout.

By the way, if you would want to configure the LockFactory you will have to configure the Aggregate Repository for the Aggregate at hand.
The Repository constructor (Axon 3) / builder (Axon 4) allows a parameter/method to set the LockFactory for a given Repository.

Hope this gives you some handles to proceed with this Bjorn!

Cheers,
Steven