Aggregate throwing LockAcquisitionFailedException

Nader_Kahwaji · October 20, 2021, 9:28pm

Hello

I’m using axon framework with spring boot version
org.axonframework:axon-spring-boot-starter:4.3.3 (exclude group: ‘org.axonframework’, module: ‘axon-server-connector’)

Recently I’ve started noticing certain business workflows halting suddenly and when I checked the logs I found this exception

Command 'com.example.MyCommand' resulted in org.axonframework.common.lock.LockAcquisitionFailedException(Failed to acquire lock for aggregate identifier(AGG_ID), maximum attempts exceeded (100))

This started occurring suddenly and its getting more frequent with time.

The aggregate is configured to use snapshotting

@Bean public SnapshotTriggerDefinition MyAggregateSnapshotTriggerDefinition(Snapshotter snapshotter) { return new EventCountSnapshotTriggerDefinition(snapshotter, 200); }

This aggregate only has a few running instances and they remain alive for a very long period of time (years)

I read that this exception is thrown if a process seems to hold a lock on the aggregate for far too long, meanwhile a command requested the lock and timeouted waiting for it.

The aggregate does not hold a big amount of data

@Aggregate(snapshotTriggerDefinition = "MyAggregateSnapshotTriggerDefinition")
public class MyAggregate {
   @AggregateIdentifier
   private String aggId;
   private boolean paused;
   private int pausingChangelist;
   private RequestCause pauseCause; //enum
   private Seat seat;
   private Queue<Reservation> reservationQueue;
   private boolean canPauseBuildPool;

  ...

}

NONE of commands dispatched to this aggregate are sent in “sendAndWait” mode.

All commands have a small payload and there is no heavy computation being done in the command Handler methods. Its litreally checking some boolean flags and raising events.

The event sourcing handlers on the other hand do some logic.
They manipulate the reservation queue by polling and inserting reservations.

@EventSourcingHandler
public void on(CertainEvent event) {
    // poll from queue if not empty
    // raise SeatReservedEvent
}

@EventSourcingHandler
public void on(SeatReservedEvent event) {
    // reserve seat
}

@EventSourcingHandler
public void on(SeatFreedEvent event) {
    //  free the seat 
    // poll from queue
    // if queue not empty -> raise SeatReservedEvent  
}

@EventSourcingHandler
public void on(SeatReservationQueuedEvent event) {
    // add to queue
}

The weird thing as well, I checked other posts where this same exception is thrown and they seems all to have the exact same error message but mine is the only that has a different number of attempts (100)

LockAcquisitionFailedException: Failed to acquire lock for aggregate identifier(AGG_ID), maximum attempts exceeded (2147483647)

I read the code of the PessimisticLockFactory and was able to understand that this number (2147483647) represent the number of time a process tried to acquire the lock on an aggregate.
Why is 100 only in my case? (NO extra config was added from my side)

How can I solve this issue? how can I monitor the locks on the aggregate? how to know what process aquired the token and wouldn’t release it?

lfgcampos · October 21, 2021, 8:38am

hey @Nader_Kahwaji,

I noticed you also opened the same question on StackOverFlow and I put my answer there!

Here is the link for others: java - Axon throwing LockAcquisitionFailedException - Stack Overflow

KR,

Nader_Kahwaji · October 21, 2021, 4:05pm

Is there a way to increase the number of aquireAttempts or the lockAttemptTimout?

milendyankov · October 22, 2021, 11:47am

I added an answer on StackOverflow but let me also put it here for future reference:

It seems there was a bug in 4.3.x and earlier versions. The message shows the maximumQueued and not the acquireAttempts . This commit fixes it, so if you upgrade to a recent version of Axon Framework you should see the correct value.

lfgcampos · October 25, 2021, 1:51pm

hey @Nader_Kahwaji,

I’ve added this bit to SoF as well but let me copy it here for you!

Yes, you can configure that… you need to provide your own (Pessimistic)LockFactory and use that on the EventSourcingRepository$Builder#lockFactory method.

KR,

Nader_Kahwaji · October 26, 2021, 8:43pm

Hello @lfgcampos
Is there any documentation I can follow to do this?

Nader_Kahwaji · October 29, 2021, 9:53pm

Hello @lfgcampos @milendyankov
The config suggested above is not straight forward to implement. Is there any documentation that details how to configure the timeout?

lfgcampos · November 1, 2021, 4:16pm

Hi @Nader_Kahwaji,

There is no documentation around it but I can try to give you some idea of how to do that. First of all, recently a new PR was merged which makes things easier and I am going to use that to explain it.

In this case, you need to provide a @Bean of type LockFactory with your desired values. After that, you can set this on the @Aggregate as your lockFactory, something like this:

@Bean
public LockFactory myLockFactory() {
    // your lock factory code and configs here
}

@Aggregate(lockFactory = "myLockFactory")
class MyAggregate {
    // your Aggregate
}

But mentioning again, you should better check your CommandHandler and/or EventSourcingHandler as they shouldn’t take that long anyway!

KR,

manishatGit · January 16, 2024, 9:51pm

I had encountered the same error with a high frequency. I am using GCP AlloyDB (which is nothing but a fully managed Postgres (Postgres on steroids as Google says). The mistake I was making was having the AlloyDB in a different region than the GKE cluster where the application was running. Most of the networking errors on underlying connection pool errors to JDBC calls are the possible reasons behind these errors. When I get this error in the future, the first thing I would investigate is problems with JDBC connection timeouts or attempts.