Duplicate sequence ID generated for events against the same aggregate

chris.hern · July 23, 2015, 2:17pm

Hi,

We’re seeing a strange problem when running an application using Axon 1.3.2 to process a high volume of messages in our clustered environment. The error is as follows:

org.axonframework.repository.ConcurrencyException, Concurrent modification detected for Aggregate identifier [188bd324-e445-416c-a301-a9e01e2d2c9f], sequence: [5563]

Which I believe is saying that we are trying to write two events to our event store with the same aggregate identifier and sequence number. However, we aren’t setting the sequence number ourselves and are letting the Axon framework handle that for us.

Our event sourcing repository is defined as:

<axon:event-sourcing-repository id="companyRepository" aggregate-type="net.xxx.xxx.Company" event-bus="eventBus" event-store="companyEventStore" > <axon:snapshotter-trigger id="companySnapshotterTrigger" event-count-threshold="20" snapshotter-ref="userSnapshotter" /> </axon:event-sourcing-repository>

and the event store as:

<axon:jpa-event-store id="companyEventStore" event-serializer="eventStoreSerializer" data-source="dataSource"/

Below is an extract of our logs which shows the problem happening:

2015-07-23 08:21:04,469 INFO [quartzSchedulerStaticPollers_Worker-7] user.services.UserManagementService: Dispatching AddImportedUserCommand for user assill1175208771@everyma1l.us 2015-07-23 08:21:04,470 INFO [quartzSchedulerStaticPollers_Worker-7] command.handler.UserAdminCommandsHandler: Handling AddImportedUserCommand for user {}6d6c99ac-9dec-4df0-8f24-5ba5bfd2a17f 2015-07-23 08:21:04,471 INFO [quartzSchedulerStaticPollers_Worker-7] domain.user.User: Dispatching UserCreatedEvent for assill1175208771@everyma1l.us 2015-07-23 08:21:04,471 INFO [quartzSchedulerStaticPollers_Worker-7] domain.user.RmmUser: Called User domain super constructor for user loginId assill1175208771@everyma1l.us 2015-07-23 08:21:04,512 INFO [quartzSchedulerStaticPollers_Worker-7] user.services.UserManagementService: Dispatching AddImportedUserCommand for user hmorrison1626001654@ma1l2u.com 2015-07-23 08:21:04,514 INFO [quartzSchedulerStaticPollers_Worker-7] domain.user.User: Dispatching UserCreatedEvent for hmorrison1626001654@ma1l2u.com 2015-07-23 08:21:04,514 INFO [quartzSchedulerStaticPollers_Worker-7] domain.user.RmmUser: Called User domain super constructor for user loginId hmorrison1626001654@ma1l2u.com 2015-07-23 08:21:04,528 ERROR [quartzSchedulerStaticPollers_Worker-7] hibernate.util.JDBCExceptionReporter: Duplicate entry '188bd324-e445-416c-a301-a9e01e2d2c9f-5563-RmmCompany' for key 'aggregateIdentifier' ..... org.axonframework.repository.ConcurrencyException, Concurrent modification detected for Aggregate identifier [188bd324-e445-416c-a301-a9e01e2d2c9f], sequence: [5563]

The lines highlighted in green show the successful processing where we have a user created in our system and the bits in yellow highlight an error case where the required user is not created.

When I query our DomainEventEntry table for the aggregateIdentifier 188bd324-e445-416c-a301-a9e01e2d2c9f and sequence number 5563, I can see an event which relates to the user that was processed directly before the error case (i.e. the user highlighted in green).

So, it looks to me like the sequence number generation isn’t working correctly is generating the same sequence number twice for two events which are dispatched against the same repository in quick succession. However, looking at the code which generates the sequence number, I can’t see how this is case considering my understanding is that all of the events are placed on the event bus synchronously by one thread. Therefore it shouldn’t be possible for the second event to be applied before the processing for applying the first one has completed. This is code from the Axon Framework which generates the sequence number:

private long newSequenceNumber() { Long currentSequenceNumber = getLastSequenceNumber(); if (currentSequenceNumber == null) { return 0; } return currentSequenceNumber + 1; }

What could be the reason for this behaviour? As pointed out above, we only see this when running a high volume of traffic through our clustered environment and although the environment is clustered, all of the activity is happening on one node, so that takes out the possibility of multiple nodes accessing repositories, etc. Additionally, it doesn’t happen for all commands we dispatch. E.g. we dispatch 30000 commands to create 30000 users and between 1-10 users will fail with this problem (the number differs every time we run our tests) and the rest will work as expected.

Thanks in advance for your help.

Chris

Allard · July 24, 2015, 9:31am

Hi Chris,

the issue is most likely the Transaction Isolation level of your database (connection). Many databases (MySQL for example) use a default Isolation Level of “Repeatable Read”. When two threads arrive, they start a transaction and load the aggregate. A lock will prevent the second thread from changing the aggregate until the first thread has persisted the new events and released the lock. Then, second thread reads the aggregate, but doesn’t see the newly created events, since they were created after the transaction started. This will result in a duplicate key violation.

Solution is to configure an Isolation Level of “Read Committed” on your datasource. Given you’re using Spring, have a look at the IsolationLevelDataSourceAdapter (http://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/jdbc/datasource/IsolationLevelDataSourceAdapter.html).

Cheers,

Allard

chris.hern · July 28, 2015, 9:45am

Hi Allard,

Thanks for the response.

I have tried your suggestion (I applied the setting directly on the data source as opposed to using the Spring IsolationLevelDataSourceAdapter) but we are still experiencing the same problem.

Your explanation mentioned two competing threads, however, from our logs, it doesn’t look there are two competing threads, e.g. you can see in the stack trace above that the thread ‘quartzSchedulerStaticPollers_Worker-7’ is operating on both of the competing users so maybe there is a different underlying cause.

Do you have any other thoughts as to what could be the problem?

Cheers,

Chris

Allard · July 29, 2015, 6:12pm

Hi Chris,

are you sure the logs show a single thread processing the same aggregate? The UUID mentioned in green is different than the one mentioned in the yellow. It could be that thread #7 succesfully modified/imported one aggregate, but failed when executing another. I would check for the logs to see when aggregate “188bd324-e445-416c-a301-a9e01e2d2c9f” of type “RmmCompany” was updated. The green part of the logs are about a user aggregate (?) with ID 6d6c99ac-9dec-4df0-8f24-5ba5bfd2a17f.

Cheers,

Allard

chris.hern · July 30, 2015, 7:25am

Hi Allard,

Yes, I realise that may be confusing.

The command handler, UserAdminCommandsHandler, handles a command to add a user to our system. This involves creating a new User aggregate and adding this to the User repository. After that it then uses an attribute in the handled command to go and load a Company aggregate from the Company repository and tries to add the user to that company. It is in on the Company repository where the ConcurrencyException occurs (as the process of adding the user to the company results in an event being applied form the company repository). The company ID used to load the company repository is the same for both users.

The UUID in the logs is the aggregate ID of the User that is created.

Does that answer your question?

Cheers,

Chris

Allard · July 31, 2015, 1:08pm

Hi Chris,

I didn’t realize until just now that you’re on Axon 1.3.2. That’s a release of over 2 years ago. A lot has happened and has been improved in Axon since that time. This includes some issues in the Locking mechanism, which maight be related to this issue. I would strongly suggest upgrading your codebase to Axon 2.x if you have the time, or otherwise 1.4.

The release notes of Axon 1.4 (http://www.axonframework.org/release-notes/release-notes-1-4/) contain a notification about an issue with locks in the Asynchronous Event Handlers. It is likely to be the cause of your problem.

Cheers,

Allard

chris.hern · August 3, 2015, 9:04am

Hi Allard,

We have upgraded to Axon 1.4 and are still experiencing the same problem. At the moment, we are holding off from upgrading to 2.x.

Do you have any other suggestions?

Cheers,

Chris

Allard · August 3, 2015, 1:21pm

Hi Chris,

I would need to have more details about your configuration. The devil is always in the details…

I am especially interested in knowing where you put your transaction boundaries, what datasource implementation you use, which locking strategy you use, which type of database, etc.

Cheers,

Allard

chris.hern · August 4, 2015, 1:59pm

Hi Allard,

Some of our key config is below:

`
<axon:annotation-config command-bus=“commandBus” event-bus=“eventBus” executor=“taskExecutor” />

<axon:command-bus id="commandBus"> <axon:interceptors> <bean class="org.axonframework.commandhandling.interceptors.SpringTransactionalInterceptor"> <property name="transactionManager" ref="transactionManager" /> </bean> </axon:interceptors> </axon:command-bus>

<axon:event-bus id=“eventBus” />

<task:executor id=“eventBusExecutor” pool-size=“10-25” queue-capacity=“2147483647” rejection-policy=“CALLER_RUNS” />

<axon:event-sourcing-repository id=“aRepository” aggregate-type=“aaa.bbb.ccc.ddd.Object” event-bus=“eventBus” event-store=“objectEventStore”>
<axon:snapshotter-trigger id=“objectSnapshotterTrigger” event-count-threshold=“20” snapshotter-ref=“objectSnapshotter” />
</axon:event-sourcing-repository>

<axon:jpa-event-store id=“objectEventStore” data-source=“dataSource” event-serializer=“eventStoreSerializer” />
`

For data source we are using a standard javax.sql.DataSource data source with transaction isolation set to ‘READ_COMMITTED’. This connects to a MySQL database. All of the repositories we have configured at the moment extend from AbstractAnnotatedAggregateRoot.

In terms of transaction boundaries, I’m not sure what you need to know here. We have several view denormalizer classes which have their own transactional boundaries but apart from that we are not explicitly defining any transactional boundaries ourselves.

Regards,

Chris

Allard · August 6, 2015, 7:15pm

Hi Chris,

so far, the only potential issue in the configuration I see is the lack of a transaction manager on the EventScheduler. Before diving into too much detail, let’s try that and see how it works out.

Cheers,

Allard