All commands for particular aggregate result in LockAcquisitionFailedException

Kramer_Allen · August 14, 2019, 8:55pm

All commands for a specific aggregate result in a LockAcquisitionFailedException

I’ve tried restarting the applications, and the issue persists

Configuration:

Spring + JPA
MySql
HikariCP
Event-Sourced Aggregates
Asynchronous Command Bus
IntervalRetryScheduler
TrackingEventProcessors

All configuration is “out-of-the-box” with the exception of some simple logging interceptors and correlation data providers

This is a multi-node deployment that does not distribute commands. Throughput is fairly low, so we’ve been using a RetryScheduler to handle ConcurrencyExceptions, which have been very infrequent.

This solution has been working for several months without incident. Starting 2 days ago, all commands for a certain aggregate are throwing

LockAcquisitionFailedException(Failed to acquire lock for aggregate identifier(SOME IDENTIFIER), maximum attempts exceeded (2147483647))

Commands for this aggregate can be emitted by 2 Sagas - both backed by tracking event processors. There are a few relevant warnings and errors coming from those Sagas

SQL Error: 1205, SQLState: 40001

Lock wait timeout exceeded; try restarting transaction

Looking at token_entry table, I can see that this issue is present even when those tracking event processors are on the same node

The DeadlineManager is used extensively. We schedule certain deadlines based on timestamps provided to us in a REST controller. Many of those timestamps are exactly identical, which could trigger many events simultaneously. Each of those events are handled by a saga and issue a command to this type of Aggregate. So, since we are using the AsynchronousCommandBus, those commands are “fire and forget”, so they could operate concurrently on the same aggregate. But, event when I call .join() to make the saga wait on the command execution, the issue still persists.

From what I’ve read in the Google group, it is obviously better to distribute commands. But I was also under the impression that if one can live with managing some ConcurrencyExceptions, then not distributing the commands would be acceptable. Any insight into this issue would be greatly appreciated!

Steven_van_Beelen · August 15, 2019, 10:06am

Hi Kramer,

That’s an interesting issue you’re encountering.
Firstly, you state that you have a mostly default configuration of an Axon application.

In any default Axon app, an Aggregated will be locked prior to a thread calling the Command Handling function.
Added, by default, the locking scheme is such that if a thread fails to acquire a lock for 60 seconds, that’s when you get the exception you are encountering.

This means, that there must be a lock on said Aggregate, so a the very least a single command is coming through.
Sadly however, this command (or subsequent commands) take a “tremendously” long time to executing the actual command at hand.

Any how, can you confirm through any manor that at least a single command for the given Aggregate is handled?
You could for example verify this by retrieving the Event Stream for the given Aggregate and verify the timestamp on the Event with the highest sequence number.

Next to this, does this Aggregate by chance have a huge amount of events tied to it?
I am asking this as Event Sourcing said Aggregate increase the time a thread has a lock on the Aggregate.
If this is the case, I suggest you to set up Snapshotting for the given Aggregate implementation to overcome the issue at hand.

Let’s figure this out Kramer!
Ow, and nice to be talking to you again; hope everything is going well.

Cheers,
Steven

Kramer_Allen · August 15, 2019, 2:05pm

Hey Stephen, thank you for the rapid response!

This aggregate is quite large. Both in terms of event history and state. There are about 1100 events in its history, but we do snapshot every 100 events. Conceptually, you can think about it as being analogous to Axon Banking Trade Engine. It’s a long-living aggregate that stores a tree-set in its state. Items can be added, and the business logic of adding an item can cause it to remove another. Operations on the TreeSet are LogN. The size of the tree-set hovers between 400-700. The primary difference is that this aggregate uses a deadline to periodically (daily) remove ~50 items from the TreeSet.

We did at one point see some errors about failing to load the aggregate snapshot, but those errors were not frequent or consistent. We tried to simulate this in a lower environment by creating an aggregate of this type with >1000 events and then deleting the snapshot. It was still able to reload events from the event store. I don’t suspect a memory issue. The whole request took a few seconds, so maybe an occasional failure to load the snapshot could trigger this timeout when there are multiple commands waiting to process?

We couldn’t find any explanation for why the snapshot failed to load. We copied the XML of the snapshot to a file and the XStreamSerializer was able to deserialize it without any issue. So, we know it wasn’t due to any kind of structure-change / xml issue.

Again though, I can’t even find in our logs the snapshot loading error, and we are riddled with the locking errors.

More config info:

This aggregate is snapshotted
tracking event processor batch size = 20

@Bean
public SpringAggregateSnapshotterFactoryBean snapshotterFactory() {
return new SpringAggregateSnapshotterFactoryBean();
}
/**

Snapshot definition for My Aggregate. This Aggregate lives forever, so reading its entire event

history not safe
*/
@Bean(MY_AGGREGATE_BASKET_SNAPSHOT)
public SnapshotTriggerDefinition myAggregateSnapshotTrigger(final Snapshotter snapShotter, final AxonProperties properties) {
return new EventCountSnapshotTriggerDefinition(snapShotter, properties.getSnapshotThreshold());
}

Update from what we saw last night:

The event processors for the 2 Sagas that issue commands to this aggregate had completely halted. The token location hadn’t updated in hours. I had added a .join() to calls to the commandGateway from those Sags, so the locking exceptions must be bubbling up to the Saga’s and preventing them progressing - which I would think is a good think from a data quality perspective.

To test whether this was an issue with having multiple nodes running, we scaled production down to one instance for an hour, but nothing changed.

A few minutes after we scaled the cluster back to 2 instances, the locking issue disappeared, and those sagas processed their entire backlog.

Obviously we still need to find a root cause. This makes me think it is some kind of race-condition / timing-edge-case. That would explain why it has proven impossible to reproduce in lower environments.

Any other information about our config or use-case that could be relevant?

Steven_van_Beelen · August 16, 2019, 9:15am

Hi Kramer,

From what you are explaining about your Aggregate structure, the number of events and the set up of a Snapshotter, I am pretty sure that it’s not an issue with Aggregate load times that’s causing the Lock Factory to throw the exception you are seeing.

Your added updated does not make the problem at hand any clearer either, sadly.
To be honest, it only makes me lean more towards “a hanging process which owns the lock”, without a way to actually tell that was the case.

I don’t think we will find the problem at hand by having more information about your configuration or the implementation.
I would much rather see a thread dump of the application at the point in time where you see the lock happening again.
With that info we should be able to deduce which thread has the lock at that moment in time, hopefully bringing us closer to a resolution.

Thus, if the problem occurs again, a thread dump would be valuable to help you out.

Cheers,
Steven

Kramer_Allen · August 16, 2019, 4:18pm

That makes sense. I’ll see what I can do!

Kramer_Allen · August 22, 2019, 10:28pm

I’ve observed that this particularly large aggregate can take upwards of 500ms to process a command, even with reading a snapshot. This typically happens when that aggregate has not been read in a while.

I’d like to go back to your hypothesis that some operation is taking a long time, and that is what is causing the LockAquisitionFailedException.

Let’s assume that 20 commands for a single aggregate arrive on the command bus in a 3s window. Would you expect 400-800ms per command to be enough delay to cause the issues mentioned in the OP?

If this is the case, I’ve mitigated this scenario by going back to the SimpleCommandBus. Now, if there are 20+ sequential events that result in a Saga sending commands to this aggregate, the Saga would have to wait for the command to totally process before moving on to the next event. We haven’t seen the issue since. Very interested in your thoughts about whether this plausibly explains the issue.

Thanks!
Kramer

Steven_van_Beelen · August 26, 2019, 8:32am

Hi Kramer,

If loading is taking that long, you might consider setting up caching for your Aggregates as well.
To that end, the framework provides the CachingEventSourcingRepository, which you can build by using the EventSourcingRepository.Builder#cache(Cache) method.

When the regular EventSourcingRepository builder is given a Cache, it will build a caching version of the repository for you.

I assume setting up caching will speed up things better than moving back to the SimpleCommandBus.

To your question whether I/we think if your description could be the root cause of the issue…I am not entirely sure.

Could be, but it could very well still be something else.
Regardless though, Aggregate loading can be sped up, with the aforementioned caching solution in place.
I feel this can be beneficial any how for large aggregates.

Hope this helps!

Cheers,
Steven

Kramer_Allen · August 26, 2019, 2:56pm

Hey Steven (just realized I’ve been spelling your name wrong, sorry about that!)

Distributing commands is a pre-requisite for caching, correct? In this particular case, 98% of commands sent to this aggregate will come from a Saga, which only has one segment. So, with the exception of a small handful of requests handled through the API layer (which are randomly load balanced across nodes), most commands for this aggregate will naturally land on the same instance. Is caching viable here, or is distributing commands a firm requirement?

Or can this be accomplished with a distributed cache, like Redis? If this has been done before, any links or resources would be great!

Thanks!

Steven_van_Beelen · August 28, 2019, 10:06am

Hi Kramer,

No worries about my name, I knew who you wanted to address so.

You are by the way right that we recommend to use a distributed command bus to ensure consistent routing of commands for a given aggregate.
Doing so ensure you cache wont go stale, as we describe here in the Reference Guide too.

Although the framework will “guard” you in the scenario a command is send to a node which does not contain the cache entry, as it will throw a ConcurrencyException for the event to be stored, it might not be that nice a solution after all…

I assume setting up a distributed command bus is a no-no for your application, correct?

If so, another tweakable component you could adjust to overcome this LockAquisitionFailedException is the PessimisticLockFactory, where you can change things like the number of “acquire attempts”, the “maximum queued” and the “lock attempt timeout”.
To set a custom PessimisticLockFactory, you can use the Builder pattern of the Aggregate Repository being used.

Cheers,
Steven

Kramer_Allen · September 17, 2019, 8:05pm

New thought.

We have a batch size of 20. Let’s hypothetically say the performance of this aggregate is terrible and it takes 500ms to load. If we have a Saga backed by a TEP which needs to process 20 of these commands before committing the transaction, is it possible we are hitting the claim timeout?

I’m assuming after the claim timeout, one of the other nodes in the cluster could claim that token and would immediately begin processing the same events? Also, I wonder if the AsynchronousCommandBus with a retry scheduler could really exacerbate the problem. Commands on the previous node retrying commands while the new TEP is attempting to interact with the same aggregate (which again is super slow).

Curious if this is plausible.

Thanks,
Kramer

Steven_van_Beelen · September 18, 2019, 2:05pm

Hi Kramer,

Although that doesn’t sound to off, I do not think this is the issue.
Although the claim timeout would potentially be exceeded if a single Event handled by a Saga dispatches 20 commands to the slow-loading Aggregate, another node would still hit the write-lock placed on said token.

Unless of course, the used database does no support this operation.
In that case, this hypothetical scenario would hold.

Cheers,
Steven