TrackingEventProcessor switching during batch process?

klauss42 · May 21, 2021, 7:27am

My application runs an import process to import estimate data in a batch. The import produces thousands of aggregates and of course thousands of events.
We use a single node Axon Server SE (v4.5.1) running and 3 application nodes running in Kubernetes.
The tracking event processors are configured with threadcount 3 and batch size 1000.

While an import process is running I see the following log messages and I do not completely understand what is happening:

node1: INFO [ DefaultDispatcher-worker-1] i.f.m.i.domain.importer.ImportAdapter : Imported 2500 of 24294 estimates
node2: INFO [EstimateProjectionProcessor]-1] o.a.e.TrackingEventProcessor : Fetched token: IndexTrackingToken{globalIndex=2978881} for segment: Segment[1/1]
node1: INFO [EstimateProjectionProcessor]-2] o.a.e.TrackingEventProcessor : No Worker Launcher active. Using current thread to assign segments.
node1: INFO [EstimateProjectionProcessor]-2] o.a.e.TrackingEventProcessor : Worker for segment Segment[1/1] stopped.
node1: INFO [EstimateProjectionProcessor]-2] o.a.e.TrackingEventProcessor : Released claim
node1: INFO [EstimateProjectionProcessor]-2] o.a.e.TrackingEventProcessor : Segment is owned by another node. Releasing thread to process another segment...
node2: INFO [EstimateProjectionProcessor]-0] o.a.e.TrackingEventProcessor : Dispatching new tracking segment worker: TrackingSegmentWorker{processor=EstimateProjectionProcessor, segment=Segment[1/1]}
node2: INFO [EstimateProjectionProcessor]-0] o.a.e.TrackingEventProcessor : Worker assigned to segment Segment[1/1] for processing
node1: INFO [ DefaultDispatcher-worker-2] i.f.m.i.domain.importer.ImportAdapter : Imported 2400 of 24294 estimates
node1: INFO [ DefaultDispatcher-worker-2] i.f.m.i.domain.importer.ImportAdapter : Imported 2300 of 24294 estimates

These messages only show up during these batch imports. Whats the issue? I read the logs as if the TEP is switched from one node to another for a segment. Is this the case? Why does this happen?
Do I need to change my configuration to avoid this?

Thanks for any hints
Klaus

allardbz · May 21, 2021, 11:59am

Hi Klaus,

what happens here, most likely, is that processing of a batch took longer than the claim timeout for a Token in the TokenStore (10 seconds). In that case, since a token will appear to have been abandoned by another node, it “steals” the claim and starts processing events. Then, the original owner of the token notices it has lost the claim, and steps out of processing.

If you do processing, normally the row in the token store is locked in the duration of the transaction, preventing other instances from stealing a claim for which events are still being processed.

Which Axon Framework version do you use? Which TokenStore implementation do you use, and which backing store?

Klaus_Schroder · May 22, 2021, 3:08pm

Hi Allard
We are using latest Axon 4.5 with Axon Server 4.5.1. Backing store is Postgres on AWS/RDS. I think it is the default token store implementation.

Klaus

klauss42 · May 25, 2021, 11:06am

Maybe one more information: We have multiple bounded contexts each with its own database schema in Postgres for the projections. Currently we run all contexts in one single monolytic Spring Boot application. so that there is one app accessing multiple db schemata.
The Axon entities (e.g. tokenentry, sagaentry) are located in another db schema, but we only have 1 token store for all contexts.
Maybe this could be related to the behavior I described above?

Klaus