When debugging some strange complaints from our query-side models, I noticed that during replay there were some events missing. For example, for one particular aggregate the sequence identifiers of the replayed events would go 1, 2, 3, 4, 5, 6, 7, 135, 136, 137, 138…
(Note: we currently do not use snapshotting)
Perhaps I’m not understanding the code correctly, but it looks like the problem is in JpaEventStore.doVisitEvents(): when fetching from the EventEntryStore, it specifies a batchSize which – I think – actually works as a limit for the total number of rows returned by the underlying JPA query, not the number of rows per batch. In fact it looks like there can only be one batch: given the query it uses, there’s no way for the (JPA) DefaultEventEntryStore.BatchingIterator to get a second batch which continues from where the first batch ended.
So, if the replay should involve more events than the batchSize set on the JpaEventStore, some events will be missing from the replay. Depending on the database platform, chunks of each aggregate’s event sequence could be missing from anywhere in the sequence.
Is this actually a problem or have I missed something?
Obviously the workaround is to set the JpaEventStore batchSize to a very large number. But I think the way the batchSize affects visitEvents() is not expected behaviour: the description for EventStoreManagement.visitEvents() says “loads all events”.
(Allard, if you’re reading this, many thanks for Axon, it is amazing!)
when doing replays, snapshots are never used. A snapshot is only a replacement for events when loading aggregate state (which is not a replay).
The batch size is used to prevent memory overflow when using JPA. JPA returns a result set as a list, causing everything to be loaded in memory before returning it. The batchSize is used to limit the result size in the query. When the resulting event stream is read (and fed to the visitor), it will run a new query when it reaches the end of the result set. Sorting of the events is done based on timestamp, sequence number and then aggregate identifier. The latter two are in place to ensure a consistent order across different replays.
The timestamp column is stored as an ISO8601 date time format. If you have switches in timezone there, it is very well possible that the order for replays becomes incorrect. Could you check the timestamp for the aggregate that you mentioned, and then especially the one of event with sequence number 7 and 8?
Cheers,
Allard
PS. I’m reading this Glad you like it! Please spread the word…
Thanks for your response. I see now how DefaultEventEntryStore.FilteredBatchingIterator gets its subsequent batches, by adding the key attributes of the previous batch’s last event to its where clause. So that seems ok.
Given that all the events in my test DB were generated on one machine in London, it seems unlikely that there were any timezone changes affecting this (but it’s not completely impossible). I can try generating a fresh set of events with the event store’s forceUtc flag set to true.
I wonder whether event timestamps should really be joda Instant, not joda DateTime, especially considering that the timestamps are then stored as strings in the DB? In our team we just had a problem elsewhere with Java 8 ZonedDateTime: its equals() method only returns true if the two instances represent the same instant with the exact same timezone-ID (not just the same offset from UTC). I believe joda DateTime has similar behaviour.
Our system shares events globally between several regional clusters, so in general we need timestamps to be sensibly comparable at every location. We should probably consider converting EventMessage timestamps to UTC when they arrive at each regional cluster.
Anyway, I’ll experiment with batchSize a bit more to see if I can pin down where the problem really was.
On replay, I see a total of 1000 events replayed (my ReplayController counts them):
2015-09-16 17:26:21,219 INFO [main: ReplayingCluster] - Replay ended. Switching back to live mode
2015-09-16 17:26:21,219 INFO [main: ReplayController] - -------- Replay complete. 1000 events replayed --------
If I set that batch-size to 1234, I see exactly 1234 events in the replay. (There are actually >13500 events in the DB)
In the debugger, I can see that the first batch of events has exactly the batch-size number of events in it, but the second batch has zero events (so the replay finishes at that point).
The first batch query has no where-clause. For the second batch query, jpa.DefaultEventEntryStore.BatchingIterator.fetchBatch() uses a where-clause with a timestamp query parameter with the (String) value “2015-09-16T17:08:45.705+01:00”. However all the events in the DB have timestamps in UTC: the String timestamp in the DB for the last event of the first batch is actually “2015-09-16T16:08:45.705Z”. Note the offset – I think this is the problem.
Possible fix: jpa.DefaultEventEntryFactory.resolveDateTimeValue() could check the forceUtc flag, like that class’s createDomainEventEntry() method does. (I’m not sure though…)