We are currently evaluating Axon (JPA/PostgreSQL) in a POC. After enabling the TrackingEventProcessor, a lot of our events in a bigger batch operations test weren’t published anymore to other EventHandlers after their commitment to the event store.
The root cause of this seems a trivia: We have a hell lot of gaps in our DomainEventEntry table. In cooperation with maxGapOffset in GapAwareTrackingToken.advanceTo(long, int), these missing events were just thrown away as “missing events” and will never come back, if not manually forcing them.
So the rhetorical question is: Why do we have so many gaps?
Well, the DomainEventEntry uses the default hibernate_sequence for generating its global index. The same sequence is used by AssociationValueEntry. Our business logic is forcing us to use lots of association values. Since both entities share the same sequence generator, we have lots of gaps, the GapAwareTrackingToken is trying to keep track of. But most of these gaps aren’t “missing” events. These events just don’t and will never exist, since they are AssociationValueEntry’s and not missing events.
We fixed this for our POC by changing the DomainEventEntry to use its own sequence generator for the generated global index. Now, all the events are published after commitment even in bigger batch tests and even the memory usage looks more human (had some OOM issues with 10.000 gaps and 10.000 cached events, maybe identical to the last thread by Pietro Marrone?).
So two real questions:
- Is our fix correct? Or will/would/could it led to other issues we didn’t saw yet?
- Will it be fixed in the axon code base? Should we provide a pull request?