Event processor segment broken?

jcwil · February 8, 2022, 8:44pm

Hello,

I’ve been running Axon Server SE 4.4.8 with Axon Framework for a little while now, and I’ve run into a problem with event processor segments that I just can’t figure out.

We have configured an event processor with multiple threads and multiple segments. Somehow, one of the segments seems to have gotten into an unrecoverable state. It has stalled, is no longer making progress (it doesn’t appear to be processing events at all, not the usual case of a handler repeatedly failing on a specific event), and any management operations we try to perform on the segment fail (meaning we can’t split or merge the segment).

The Axon Server admin UI shows the following error in relation to the segment:

“Processor in error state: java.lang.IllegalStateException: Unable to split the given segmentId, as the mask exceeds the max mask size.”

And if we try to merge the segment, the UI doesn’t do anything, but we get the following error in the Axon Server logs:

axon-server-0] 2022-02-08 19:43:57.974  INFO 1 --- [io-8024-exec-11] 
A.i.a.a.r.EventProcessorRestController   : [<anonymous>@default] Request to merge segment of event processor "app.txkt.api.ui.messenger.graphql.schema.conversations" in component "bff-1".
[axon-server-0] 2022-02-08 19:44:03.729  WARN 1 --- [ruction-result1] i.a.a.c.processor.EventProcessorService  : The following operation is taking to long: merge_event_processor_segment {
[axon-server-0]   processor_name: "app.txkt.api.ui.messenger.graphql.schema.conversations"
[axon-server-0]   segment_identifier: 21
[axon-server-0] }
[axon-server-0] instruction_id: "fb367b48-47ca-4c84-aa9d-e1a698ea1fd3"
[axon-server-0]

And in the application logs:

api-ui-messenger-9bcdff4c6-9x28x] 2022-02-08 20:40:07.586  INFO 1 --- [versations]-515] o.a.e.TrackingEventProcessor             : Fetched token: MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=MergedTrackingToken{lowerSegmentToken=IndexTrackingToken{globalIndex=166193232}, upperSegmentToken=IndexTrackingToken{globalIndex=166179281}}, upperSegmentToken=IndexTrackingToken{globalIndex=166260699}}, upperSegmentToken=IndexTrackingToken{globalIndex=166179800}}, upperSegmentToken=IndexTrackingToken{globalIndex=166263416}}, upperSegmentToken=IndexTrackingToken{globalIndex=166207202}}, upperSegmentToken=IndexTrackingToken{globalIndex=166295327}}, upperSegmentToken=IndexTrackingToken{globalIndex=166181120}}, upperSegmentToken=IndexTrackingToken{globalIndex=166234831}}, upperSegmentToken=IndexTrackingToken{globalIndex=166265671}}, upperSegmentToken=IndexTrackingToken{globalIndex=166267345}}, upperSegmentToken=IndexTrackingToken{globalIndex=166257993}}, upperSegmentToken=IndexTrackingToken{globalIndex=166206421}}, upperSegmentToken=IndexTrackingToken{globalIndex=166185475}}, upperSegmentToken=IndexTrackingToken{globalIndex=166294096}}, upperSegmentToken=IndexTrackingToken{globalIndex=166189327}}, upperSegmentToken=IndexTrackingToken{globalIndex=166246358}}, upperSegmentToken=IndexTrackingToken{globalIndex=166313078}}, upperSegmentToken=IndexTrackingToken{globalIndex=166208722}}, upperSegmentToken=IndexTrackingToken{globalIndex=166211296}}, upperSegmentToken=IndexTrackingToken{globalIndex=166179281}}, upperSegmentToken=IndexTrackingToken{globalIndex=166197957}}, upperSegmentToken=IndexTrackingToken{globalIndex=166181516}}, upperSegmentToken=IndexTrackingToken{globalIndex=166191010}}, upperSegmentToken=IndexTrackingToken{globalIndex=166179281}}, upperSegmentToken=IndexTrackingToken{globalIndex=166179281}}, upperSegmentToken=IndexTrackingToken{globalIndex=166179281}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713811}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713854}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713854}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713863}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713867}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713867}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713867}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713867}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713867}}, upperSegmentToken=IndexTrackingToken{globalIndex=168713901}}, upperSegmentToken=IndexTrackingToken{globalIndex=168714019}}, upperSegmentToken=IndexTrackingToken{globalIndex=168714099}}, upperSegmentToken=IndexTrackingToken{globalIndex=168714160}}, upperSegmentToken=IndexTrackingToken{globalIndex=168714210}}, upperSegmentToken=IndexTrackingToken{globalIndex=168714301}}, upperSegmentToken=IndexTrackingToken{globalIndex=168714345}}, upperSegmentToken=IndexTrackingToken{globalIndex=169215396}}, upperSegmentToken=IndexTrackingToken{globalIndex=169235705}}, upperSegmentToken=IndexTrackingToken{globalIndex=169252532}}, upperSegmentToken=IndexTrackingToken{globalIndex=169359423}}, upperSegmentToken=IndexTrackingToken{globalIndex=169371933}} for segment: Segment[22/1023]
[api-ui-messenger-9bcdff4c6-9x28x] 2022-02-08 20:40:07.641  WARN 1 --- [versations]-515] o.a.e.TrackingEventProcessor             : Releasing claim on token and preparing for retry in 60s
[api-ui-messenger-9bcdff4c6-9x28x] 2022-02-08 20:40:07.643  INFO 1 --- [versations]-515] o.a.e.TrackingEventProcessor             : Released claim

Other segments seem to be working properly, and can merge and split.

Steven_van_Beelen · February 9, 2022, 11:09am

Hey @jcwil and welcome to the forum!

That’s a curious exception you’re getting there actually.
Can’t say I’ve seen anybody receive it in the wild.

What I spot on the image, however, is that you’re using an extreme amount of segments.
Or, at least in my books.
I would typically not recommend exceeding over 64 segments as a general rule of thumb.
There are exceptions, of course, but I regardless wager that having segments that control 1/32768 is not what you need.

By the way, you are receiving the exception because of a technical limitation of the segments.
The Segment implementation uses bit masking based on an integer.
As the exception suggests, the max mask size is reached.
As such, you can’t make more segments.

Granted, as I stated earlier, I don’t think you have to use that many segments anyhow
At all times, I would monitor your application and check whether the increased amount of segments has merit performance-wise.
If it doesn’t, then there shouldn’t be a reason to increase the amount.

Apart from the above, I am curious about what the intent is to have so many segments.
Would you be able to shed some light on that, @jcwil?

jcwil · February 9, 2022, 4:40pm

Hi @Steven_van_Beelen,

Yes, I agree that it is a lot of segments. Way more than we would normally have configured.

We had gotten ourselves into a situation where an event processor was failing, but had slipped through our monitoring, so we ended up with a backlog of tens of millions of events for this processor that we needed to work through. The events did not have any causal dependencies, but the workload was very uneven, since it involves external API calls, so we thought it safe to increase the processor count temporarily while the backlog was being consumed.

We did all of the splitting through the Axon Server UI. We set the upper limit to 100 threads, and started splitting. Oddly, we noticed that the splitting seemed to be very uneven. Some segments would be split down to the crazy numbers you’re seeing, whereas others would seem to be “happy” at 1/32, and not split themselves further, at least with the automatic split management the UI “+” button provides. Trying to understand this behaviour, I noted that in the documentation it is mentioned that the “largest” segment would be the one chosen for splitting, which I had assumed would mean the segment with the largest share of the processing (e.g. 1/2 would be split before 1/4 is considered), but since this didn’t seem to be the observed behaviour, I questioned if “largest” in this case meant the segment with the largest queue to work through. I didn’t arrive at an answer that consistently explained the behaviour I was seeing, except possibly that we were clicking “+” faster than the split was processed, and so possibly it was queuing up the same segment for another split, because it was still thought to be the largest?

Unfortunately, now we have what looks like a broken segment that we can’t merge, and so is therefore blocking both processing and merging back down to a regular number. In trying to think through possible ways to get past this, the only thing that I could come up with is if it is possible to switch to a Pooled Streaming processor now? Merge it in some other way via an API call? Or as a last resort, trying to manually fiddle with the token offsets for this stream to either somehow kill it?

Morlack · February 10, 2022, 8:37am

Hi @jcwil,

Looking at the Axon Server source code we can see that it should be splitting the smallest segment, which is the segment that has been split the least. It’s odd that you experienced such behavior. I’ll conduct some field tests today regarding this.

As for the stuck token, it contains 47 tokens merged into one. These 47 tokens still have their segments and are not resolved. The MergedTrackingToken has the following javadoc:

Special Wrapped Token implementation that keeps track of two separate tokens, of which the streams have been merged
into a single one. This token keeps track of the progress of the two original "halves", by advancing each
individually, until both halves represent the same position.

For a MergedTrackingToken to become a normal IndexTrackingToken it will need to have the same lower and upper tracking token. Since it’s 47 layers deep, the smallest index is 166193232 and the largest index in it is 169371933. That’s a very large amount and it will need time to process it. This leads me to advise one of the following two actions:

Enlarge the transaction timeout. This might give it the time it needs to resolve.
Switching to PooledEventProcessor is possible, since it handles tokens somewhat differently. I cannot guarantee it will work though.
Delete the token and replay from a certain point (by providing an initialTrackingToken). Are the events idempotent? Guessing from your statement about it calling an API I guess not?

Can you try these suggestions? Let me know if there are any other questions. Steven is out of the office for now, but I’ll try to help you the best I can.