I have found a possible bug in Axon, i’m curious someone else had (something like) this too :
The situation; we have a distributed system (mostly 2 instances) of a component using axon, which can retrieve some commands at the same aggregate at the same time. Because of this , you can get unique constraints violations on de event sequence, as the commands can arrive on both nodes. We have to get a distributed command bus to handle this correctly , but this not yet in place. In the meanwhile we do a retry of the command if this constraint violations happens.
This scenario mostly performs well, but the downside is the creation of gaps in the global identifier, as we are using oracle and a sequence for the global identifier, and every rollbacked INSERT statement will cause a (permanent) gap .Also we have some some tracking event processors
Now what happens is that when there is lots of traffic and lots of commands are arriving (one of the) components can completely freeze. They are not reachable any more, event processors halt, an health check can not reach it a a certain point , leading to lots of restarts in the docker swarm we are using.
Looking into detail it looks like this happens (can be coincidence) when the gap cleaning threshold (250) is reached. So it looks like the cleanGaps method in the JDBC event storage could be the culprit.
Anyone experienced this also ? This was a real emergency situation, which we handled now with a emergency patch by throttling the commands. This is not a good situation however, and we know we should look into more permanent solutions to prevent permanent gaps in the first place (e.g. distributed command bus or axonDB ).
But i think there is also a defect in axon to fix. We use the latest axon version 3.3.4 (we thought this fix would fix it, which was not the case : https://github.com/AxonFramework/AxonFramework/issues/704 )