Hi Axon community!
We have been struggling with an issue for quite some time.
The problem is that some event handlers never receive published events. It seems that the problem only affects Sagas and other classes within the same application, using the @EventHandler annotation. The aggregates have never been affected this far. The problem goes away for some time (from a couple of hours to a few days) after we drop all Axon related database tables (we are using PostgreSQL). It is intermittent in the sense of what subscribers that are affected by the issue. Right now it is only one of our 9 sagas that is affected. The previous time we had the issue it affected all sagas. Since the saga does not receive the starting event, it is never started.
We’re using Spring Boot v2.1.5 and Axon v4.1.1. The application runs on a GKE cluster in GCP. Worth mentioning is that we have never been able to reproduce the issue in any of our non-Kubernetes development environments. Enabling the subscribing event processor instead of the tracking one seemingly solves the problem, but this is not an option since our application will run in multiple instances in the cluster. However, scaling the number of instances down to 1, e.g. running the application in a single JVM in the cluster does not solve this problem.
Obviously it seems this has something to do with how the way tokens are handled. I have compared the token store in an “infected” environment with a healthy environment does not show anything suspicious either.
One very interesting observation is that in the infected environment we keep seeing this log print (printed on DEBUG level - should it not be considered more severe?):
Unable to claim the token for segment: 0. It is owned by another process
The above is printed by the TrackingEventProcessor, in seemingly all event processor threads, not only the one that belongs to the affected Saga. The log print is not observed in healthy environments. Is this disability to claim tokens something that Axon is meant to be able to recover from?
I would be thankful for some guiding in the right direction towards solving this long-lived issue!
Best regards,
Andreas