Can blacklisted aggregated cause duplicate events?

pbadensk · August 13, 2015, 9:55pm

Hi,

Hope you could help us understand the problem we’re having…

Context
In our logs we’re seeing java.sql.SQLIntegrityConstraintViolationException related to the events being published. This is connected with a continuous large load on the system. We have a hypothesis why this might be happening.

Hypothesis
When the unit of work is being committed exception gets thrown and the aggregate becomes blacklisted. Before the exception happened though we managed to publish the event for that command. Then when we process the aggregate again we republish the same event, does having the listeners to process them again and making attempt to insert duplicate entries to the database.

Question
Is it possible that when aggregate is blacklisted and reprocessed, same events will be published twice or even more times?

Thanks,
Pawel

Allard · August 16, 2015, 7:14pm

Blacklisting is a mechanism used by the disruptor command bus to ensure that no commands are executed against an aggregate that may be in an invalid state. An aggregate is blacklisted when a commit of a unit of work that contains the aggregate fails.
It seems more likely that the blacklisting is caused by the SQLIntegrity Contraint.

The most common reason for this error is the database’s (default) transaction isolation level. Some databases use repeatable read as default, which may cause this error. Using an isolation level of read committed will solve it.

Cheers,

Allard

Prem_C1 · August 20, 2015, 9:53am

Allard,

Thanks for the response. Pawel and I work on the same project. It may help to share some more context:

We are running a performance/stress test with an (arguably) large number of aggregates involved (60 million)
We are currently using Oracle for persistence on both the command and read side where READ_COMMITTED is the default. We aren’t overriding the default isolation level in code.
Under very heavy load, we are seeing the constraint violation errors when saving our read side projections. This happens in conjunction with an AggregateBlacklistedException.
When the same test was repeated with client-side throttling (requests being made at a lower rate), we saw no errors.
We do not use a distributed transaction to coordinate persisting to the event store and publishing to the event bus.

It leads us to believe that events are being published more than once on the event bus when an aggregate gets blacklisted. Obviously, we are expecting events to be published once and only once.

Do let us know if there is something amiss.

Prem

Prem_C1 · August 21, 2015, 10:45am

Follow-up:

We have confirmed that events are delivered more than once after an aggregate is blacklisted. Also, Axon published to the event bus first and then save to the event store. Are both of these intentional?

Thanks!
Prem

Allard · August 21, 2015, 8:16pm

Hi Prem,

which EventBus implementation are you using? I am assuming SimpleEventBus.
Since you spoke about a KeyViolationException, I asumed it happened in the Event Store tables. That’s a common problem. However, since it’s in the read model, the aggregate is blacklisted because the publication of the events failed. The transaction is rolled back and the command fails. If you have a retry mechanism, the command will fire again, causing the same events to be published again.
My guess is that it’s the key violation causing the blacklisting and the extra publication of the event, not vice versa.

Cheers,

Allard

Prem_C1 · August 22, 2015, 12:59am

Allard,

We are using the ClusteringEventBus. I should’ve been a bit more clear. The AggregateBlacklistedException (caused by a database timeout - it’s extremely busy) causes the constraint violation on the read side due to the event being published again at a later time when the aggregate is saved again.

Allard · August 23, 2015, 7:03pm

Hi Prem,

which version of Axon do you currently use, and which Cluster implementations? 2.4.1 solves an issue with async clusters, which would publish an event before the commit of the unit of work completed. This could cause events to be handled twice. Of course, it would also be handled twice by any SimpleClusters, but they generally operate in the same transaction as the one that timed out (and rolled back).

Does that shed some light on the situation?

Cheers,

Allard