Token entry with an unknown owner seems to cause stuck processor

rafiek · January 18, 2023, 1:42pm

Hi, I am in a situation where a token_entry has an unknown owner. The table looks somewhat like this:

1 owner-2
2 owner-2
3 owner-2
4 owner-1
5 null
6 null
7 …

Owner “owner-1” is unknown. Might be from a previous JVM deployment that possibly crashed. The app is running as a single instance. We are now facing the problem that the owner-1 entry is not being updated/processed. All entries after that have a null owner and are also not being updated/processed. All these entries belong to the same processor-name. It would seem that because of this situation certain events are not being processed. I think these events belong to the owner-1 and null token_entries.

My understanding of Axon is not quite there yet…

Given this situation, that seems like a problem, a few questions.

Will the entry of the owner-1 eventually be picked up by owner-2?
Will this then also cause the null owner entries to be picked up by owner-2?
If the token_entry situation recovers somehow, will past unprocessed events be processed? (met terugwerkende kracht)
Is there a way to monitor stuck token_entries?

Gerard · January 18, 2023, 2:03pm

Hi Rafiek,

To properly answer the questions, it’s relevant to know which event processor is used and whether they are all part of the same processing group.

If a token entry isn’t updated in, by default 10 seconds(claimTimeout on token store), it can be picked up by an owner with another name. But, the token is tied to a processor group, so it should be for the same processor group. And in the case the Tracking Event Processor (TEP) is used, there should be free threads so that the segment can be picked up. The Pooled Streaming Event Processor (PSEP) doesn’t have this limitation.

If you use Axon Server, you can see unclaimed segments, so stuck token_entries in the UI.

rafiek · January 18, 2023, 2:18pm

Thanks for the quick reply, Gerard!

They all use the default event processor, TrackingEventProcessor. They are all part of the same processing group. The configuration is in a property file with config:

initial-segment-count: 50

I think this means that there are also 50 threads. In the logfile I see that thread 0 to 9 are used. I assume that thread 10 to 49 are just idle. Token entries 0 to 26 are being updated, but 27 with unknown owner and 28 to 49 are not being updated.

Gerard · January 18, 2023, 2:56pm

If you didn’t set a thread-count yourself, and using a TEP, it should indeed have a max of 50 threads. But if that was the case, all 50 should have owner-2 so you might want to investigate if this really is the case. The threads being idle while there are unclaimed tokens should not be the case, they should at the minimal claim and extend the token, even when there are no events to be processed.

Steven_van_Beelen · January 18, 2023, 3:33pm

I want to add here, that 50 segments and 50 threads for a given Event Processor sounds like a lot.
We typically see that for regular event handling, something around 16 segments and threads on a TrackingEventProcessor is sufficient.

If you want more flexibility in segment sharing between instances of your application (warranting a higher number of segments), it is recommended to switch to a PooledStreamingEventProcessor (PSEP for short) instead of a TrackingEventProcessor (TEP for short).

The PSEP does not have the one-thread-one-segment predicament at all. Hence, the number of threads per application instance can be lower.

rafiek · January 19, 2023, 9:16am

Thank you for the tip. We are gaining new Axon insights every week and our latest version of the app indeed uses PooledStreamingEventProcessor with a lot fewer segments and threads. We also tried to find a good batchSize using monitoring on capacity and latency on the eventstore and eventbus. We are using a Postgres DB instead of AxonServer.

The status however is that the latest version is still in our development environment and it will take some time before we have it in production.

Yesterday we restarted the app in production and it did take ownership of the “forgotten” token entries. It also processed events that were “hanging”. Somehow it also processed events that were already processed. That was surprising. The events were (re-)processed by a Saga and the sequence policy is based on a property in the events.

Not sure why some events were re-processed, any idea?
We are not using AxonServer; is there a micrometer way of monitoring the token entries?
Is there a way to monitor events that have been re-processed?

mgreivel · January 20, 2023, 2:58pm

Hi @rafiek ,
we are in a similar setup with a PostgreSQL as eventstore. I can not answer your questions but we are monitoring our token entries for the TEPs on our own. We have a scheduled processorStatusUpdater (every 30s) wich reads all token entries from the tokenstore and then writes the results with a micrometer meterRegistry to aws cloudwatch.
We are about to integrate a bit more monitoring out of the reference guide Metrics - Axon Reference Guide into our application and will then experiment with the metrics.

I am interested in how your PSEPs are performing as we also have thought about switching from TEPs.
Currently we have only 8 threads configured, but need more parallel processing.

Steven_van_Beelen · January 23, 2023, 2:55pm

Although we can, sadly, only guess at this stage, I have two hunches in mind right now:

The event handling task wasn’t part of a transaction. So, the token could, for example, not be updated while the task itself processed regardless. Sagas, or event handler performing side-effects (like dispatching a command), are susceptible to this as the tasks they perform can typically not be rolled back.
Defining whether re-processing occurred is based on logging. Although the log statement correctly shares it was invoked several times, that does not necessarily mean that the event-handling task is occurring again, as the log statement can typically not be rolled back. So just as in the previous example, the event handler may be invoked, but the token couldn’t/didn’t update itself. That means the task’s Unit of Work is rolled back.

However, whether either of these options the case is hard to deduce.
Perhaps if you sare how you have deduced events have been handled twice, we can come to a conclusion to this.

On any note, it does exemplify why idempotency is recommended when it comes to message handling in general. I understand that’s an easier statement made than done, but it would cover for any error case, automatically or manually occurring, to not impact your system.

Axon Framework does not have anything to suggest that re-processing is taking place.
It uses the TrackingTokens to define (1) where it is and (2) who’s working on it.
It is the token that allows it to differentiate between what’s been handled and what’s not.
As the token only stores the position of the last handled event,

The sole thing that may impact event processing once the tokens are set is the Sequencing Policy.
Granted, it will only do so starting from the token’s position.
Hence, any events preceding the globalIndex value of the token in question would not be impacted by that token. If another segment is behind compared to the others, switching the Sequencing Policy may induce events that previously belonged to “Segment Foo” to now belong to “Segment Bar.” Thus, depending on the position of “Foo” and “Bar,” this may incur re-processing.

Further note that Saga Event Handling does not follows a configured SequencingPolicy at all.
Sagas are locked per segment based on a consistent hashing algorithm performed on the identifier of the saga.

Concluding

I imagine the above doesn’t give you an immediate solution, @rafiek.
I am, of course, up for further discussing the scenario you encountered.