Event handlers not receiving events

Hi Axon community!

We have been struggling with an issue for quite some time.
The problem is that some event handlers never receive published events. It seems that the problem only affects Sagas and other classes within the same application, using the @EventHandler annotation. The aggregates have never been affected this far. The problem goes away for some time (from a couple of hours to a few days) after we drop all Axon related database tables (we are using PostgreSQL). It is intermittent in the sense of what subscribers that are affected by the issue. Right now it is only one of our 9 sagas that is affected. The previous time we had the issue it affected all sagas. Since the saga does not receive the starting event, it is never started.

We’re using Spring Boot v2.1.5 and Axon v4.1.1. The application runs on a GKE cluster in GCP. Worth mentioning is that we have never been able to reproduce the issue in any of our non-Kubernetes development environments. Enabling the subscribing event processor instead of the tracking one seemingly solves the problem, but this is not an option since our application will run in multiple instances in the cluster. However, scaling the number of instances down to 1, e.g. running the application in a single JVM in the cluster does not solve this problem.
Obviously it seems this has something to do with how the way tokens are handled. I have compared the token store in an “infected” environment with a healthy environment does not show anything suspicious either.

One very interesting observation is that in the infected environment we keep seeing this log print (printed on DEBUG level - should it not be considered more severe?):
Unable to claim the token for segment: 0. It is owned by another process

The above is printed by the TrackingEventProcessor, in seemingly all event processor threads, not only the one that belongs to the affected Saga. The log print is not observed in healthy environments. Is this disability to claim tokens something that Axon is meant to be able to recover from?

I would be thankful for some guiding in the right direction towards solving this long-lived issue!

Best regards,
Andreas

An update:
It turned out that the root cause of this problem was that there was another instance of the application accidentally using the same database. So after correcting this the events are now received by all subscribers. However the log message is still printed. Is this something we can ignore or what is the deal?
Unable to claim the token for segment: 0. It is owned by another process

Best regards,
Andreas

Hi Andreas,

Happy to hear you’ve resolved the problem at hand!
From your description, I assumed that it was something along the lines of process keeping the token claim for themselves.
I’ve seen this before and feels not to off in a Kubernetes environment too.
As Kubernetes typically is able to kill and start new instances of an application, this will likely influence the token claims those applications might have too.

For you last question though, regarding the message being logged.
That’s intentional, yes.

If you have several instances of the same application, this means you have several instances of a given Tracking Event Processor.
As each Tracking Event Processor thread is required to have a claim on a token, they will periodically try to claim one to perform work.
The fact whether this is no possible, thrown as a ‘UnableToClaimTokenException’, is caught by the Tracking Event Processor and logged in the message you’re seeing.

Note that this is a DEBUG level log message and thus would typically not pop up in a production environment.

Hope this helps you out Andreas!

Cheers,
Steven