Axon EventHandlers in TrackingEventProcessors were not invoked

A few days ago a client mentioned us an entity couldn’t be found, although our system returned a 202, meaning the request was processed successfully.
The request was processed successfully as events caused by the command due to the request can be found in the EventStore.

What’s so weird about this issue is all the tracking processors were not invoked on the events caused by the first request.
On subsequent events the tracking processors were throwing exceptions as the entity couldn’t be found in the database.
So the eventhandlers for the events from the first command were not invoked, but for subsequent commands, events were invoked.

What could have been possibly gone wrong with the events from the first command, as the tracking processors were not invoked?

Our setup is running 2 axon instances connected to 1 PostgreSQL database.
The processors are backed by JPA and JDBC repositories.

We also have an interceptor for handling events to be able to do more research on this issue.
The interceptor logs when an event is handled, but also in the logs no evidence the eventhandler was invoked.

Below a simplified snapshot of the events in the EventStore:

payload_type timestamp aggregate_id sequence_number my conclusions
Aggregate1CreatedEventDueToRequest1 2021-02-08T16:25:33.763048105Z 1 0
Aggregate2CreatedEventDueToRequest0 2021-02-08T17:12:28.036313918Z 2 0 Not handled by TPs
Aggregate2Event2DueToRequest1 2021-02-08T17:12:28.036338611Z 2 1 Not handled by TPs, but handles by Subscribing Processor
Aggregate2Event3DueToRequest1 2021-02-08T17:12:28.03635331Z 2 2 Not handled by TPs
Aggregate2Event4DueToRequest1 2021-02-08T17:12:28.036372834Z 2 3 Not handled by TPs
Aggregate1Event2DueToAggregate2Event2DueToRequest1 2021-02-08T17:12:28.045952574Z 1 1
Aggregate1Event3DueToAggregate2Event2DueToRequest1 2021-02-08T17:12:28.04599099Z 1 2
Aggregate1Event4DueToAggregate2Event2DueToRequest1 2021-02-08T17:12:28.04608497Z 1 3
Aggregate2Event5DueToRequest2 2021-02-08T17:12:28.058377074Z 2 4 Handled by TPs, but exceptions as previous events not handled

I am guessing this might be something to do with the SequencingPolicy, although this requires some follow up questions.

Firstly, I assume you are using a TrackingEventProcessor (TEP) to update this Query Model. Added, is this TEP parallelized?

Secondly, does the Query Model span several aggregate instances? Thus is it comprised out of all the events from a single aggregate event stream or several. Whether this is about several aggregates of the same type or distinct aggregate types doesn’t really matter. As long as the Query Model handles events from several aggregate event streams.

If both are true, then it’s likely the SequencingPolicy which is the culprit. The SequencingPolicy dictates in which segment an event should be handled. Whenever you parallelize your TEP, you essentially break up the event stream in different portions. When events are being handled by the TEP, it is the SequencingPolicy which will decide if the event should be handled by that segment yes/no.

Furthermore, the default SequencingPolicy is the SequentialPerAggregatePolicy. This specifies that all the events for a single aggregate are handled sequentially. Any other aggregates, of the same type of different types, are simply not guaranteed to be handled within the same segment. Thus if you have a query model which spans several aggregate[s / types] and you are using the default SequentialPerAggregatePolicy, it could happen that the event which constructs the entity wasn’t handled yet by a given segment, whilst the events which update the same entity are being handled by the other.

Granted, this is an assumption as stated, since you haven’t provided any configuration on the Event Processors. If the above is incorrect, I would very much like to know what your Event Processor configuration is to better understand the situation. What makes me doubt my suggestion most, is that you state you have an interceptor in place which logs every event being handled by a TEP. At least either application instance should’ve hit the interceptor… So, are there any other exceptions in your logs pointed towards the event handling process? Furthermore, have you tested whether the problem is automatically resolved if you replay the TEP? I hope my first suggestion “is it”, but if not it would be helpful if you would provide an asnwer to the queries I’ve shared here.

Hi,

We indeed have a SequencingPolicy configured on 2 TEPs and those TEPs span 2 aggregate instances.
All other TEPs don’t have this SequencingPolicy and don’t span aggregate instances.
The issue was noticed on what we call TEP3 in this example.

Or does having a SequencingPolicy on any TEP affect all other TEPs?

Below a simplified extract of our configuration.

#TEP1 with SequencingPolicy AND segmenting
axon.eventhandling.processors.TEP1.batch-size=500
axon.eventhandling.processors.TEP1.initial-segment-count=2
axon.eventhandling.processors.TEP1.sequencingPolicy=sequentialPerSupplierPolicy

#TEP2 with SequencingPolicy NO segmenting
axon.eventhandling.processors.TEP2.sequencingPolicy=sequentialPerSupplierPolicy

#TEP3 NO SequencingPolicy WITH segmenting
axon.eventhandling.processors.TEP3.batch-size=500
axon.eventhandling.processors.TEP3.initial-segment-count=2

Thanks for providing your configuration @tomhier, it clarifies a great deal.

The fact that TEP3 has a one-to-one mapping of event stream to query model suggests that the default SequentialPerAggregatePolicy should be sufficient whenever you have segmented the stream. It wouldn’t hurt to make sure it’s not accidentally set to another policy, although I doubt that’s the problem.

This lands me back on two questions I’ve shared in my previous thread:

  1. Are there any other exceptions in your logs pointed towards the event handling process?
  2. Have you tested whether the problem is automatically resolved if you replay the TEP?

Typically this wouldn’t happen, but in the absence of logs pointed to when the problem occurred I’d be hard-pressed to tell you what exactly went wrong…

Maybe some additional questions might help:

  1. Is the token_entry table for TEP3 in the same data source as the query model it is trying to update?
  2. What type of database are you using to store the query model? Is it the same for both your events and your query models?
  3. How has transaction management been set up?
  4. Quote “The processors are backed by JPA and JDBC repositories.”, does this mean you are not using Axon Server?
  1. Are there any other exceptions in your logs pointed towards the event handling process?
    Not that we noticed, otherwise we would have some pointers to go with :slight_smile:

  2. Have you tested whether the problem is automatically resolved if you replay the TEP?
    Yes, problem is resolved on replay.

  3. Is the token_entry table for TEP3 in the same data source as the query model it is trying to update?
    Yes, it is.

  4. What type of database are you using to store the query model? Is it the same for both your events and your query models?
    PostgreSQL and it’s the same for models and events

  5. How has transaction management been set up?
    No special configuration here

  6. Quote “The processors are backed by JPA and JDBC repositories.”, does this mean you are not using Axon Server?
    Nope, not using Axon Server.