hanging axon component when cleaning gaps

Gerlo_Hesselink · August 18, 2018, 7:12am

I have found a possible bug in Axon, i’m curious someone else had (something like) this too :

The situation; we have a distributed system (mostly 2 instances) of a component using axon, which can retrieve some commands at the same aggregate at the same time. Because of this , you can get unique constraints violations on de event sequence, as the commands can arrive on both nodes. We have to get a distributed command bus to handle this correctly , but this not yet in place. In the meanwhile we do a retry of the command if this constraint violations happens.
This scenario mostly performs well, but the downside is the creation of gaps in the global identifier, as we are using oracle and a sequence for the global identifier, and every rollbacked INSERT statement will cause a (permanent) gap .Also we have some some tracking event processors
.
Now what happens is that when there is lots of traffic and lots of commands are arriving (one of the) components can completely freeze. They are not reachable any more, event processors halt, an health check can not reach it a a certain point , leading to lots of restarts in the docker swarm we are using.

Looking into detail it looks like this happens (can be coincidence) when the gap cleaning threshold (250) is reached. So it looks like the cleanGaps method in the JDBC event storage could be the culprit.

Anyone experienced this also ? This was a real emergency situation, which we handled now with a emergency patch by throttling the commands. This is not a good situation however, and we know we should look into more permanent solutions to prevent permanent gaps in the first place (e.g. distributed command bus or axonDB ).
But i think there is also a defect in axon to fix. We use the latest axon version 3.3.4 (we thought this fix would fix it, which was not the case : https://github.com/AxonFramework/AxonFramework/issues/704 )

Gerlo.

allardbz · August 27, 2018, 2:43pm

Hi Gerlo,

there is a known issue in the JDBC Event Storage engine when the number of subsequent gaps exceeds the batch size. The problem is that the way to limit the number of results from a query is vendor-specific.

What do you see halting? The event processors, or (primarily) the command handing components?
Which database do you use?
Do you have a threaddump of the (seemingly) halted JMV?

Cheers,

Allard

Gerlo_Hesselink · August 27, 2018, 3:49pm

Yes, I know of that issue (we had it here at Kadaster too ) , but i thought that is more an issue when there are lots of gaps ahead of the index of the tracking token, so the processor cannot catchup any more.
In this situation we see the gaps in the token entry in the table, so the processor has already processed these… Although it could be there a lots of gaps ahead of the token too, but in the panic situation i didn’t too much analysing i’m afraid
What i did at the end was stopping the application, remove the gaps in the json’s in the table and started the application again. That seemed to help (that wouldn’t be the case if it was that issue, would it ?)
What we saw were stopped event processors but also a complete slowing down of the app and at the end (after a few minutes) it was not responding at all anymore.
Also i have no thread dump as i did not manage to reproduce this on another environment yet.
We use Oracle …

P.S. we started a POC for using AxonDB en AxonHub for this component !

Gerlo_Hesselink · August 28, 2018, 10:36am

Allard,
What i also noticed was that everything related to events (command handling, event reading , event processors) seems halted, but other endpoints (info, health etc) reacted normally. Can there be a lock on reading of (all) events, and what can cause that ?

I tried now with local environment, pushed it to the limits with JMeter, got a lot of gaps , but Axon cleaned it up nicely after some time. No hangups…
So i think now it might have nothing to do with the gaps, but something else was going on. That said, i still think we should avoid the large number of (permanent) gaps , you agree ?
Gerlo.

allardbz · August 28, 2018, 11:21am

Hi Gerlo,

the problem with a gap is that Axon doesn’t know it’s permanent when it sees one. Also, storing timing information with a gap would make the tokens incredibly large. Our design assumes gaps are rare, and really, they should be. Axon will not send events to the event bus (/store) unless a Unit of Work is being committed. If you setup has a lot of rollbacks after dispatching these events, you may want to reconsider that design.

We have seen one situation, with another database type, where the database used page locks by default, instead of row locks. That would create a lot of lock contention on the table, as most reads and writes happen in the most recent part of the table. You might want to check those settings in your database.

Cheers,

Allard