Some questions about the handling of gaps in JPA event store

lbilger · July 11, 2024, 12:32pm

Hi!

My colleagues and I investigated an issue with our application where events were not processed by the tracking event processors. There were no log messages pointing to any processing errors.

Looking at the domain_event_entry table, we found that the order of the timestamps was not the same as the order of the global_index value. Ordering by global index, we found a sequence of exactly 50 events with out-of-sequence timestamp values spread over a period of about an hour. Our assumption is that this was caused by one instance of the application allocating a block of 50 sequence values (the default in Hibernate), but taking some time to use them.

Having explained the order of events in the table, we are still not sure why some of the events were not processed and if this is even related. I read some articles about gap handling and the default parameters for the removal of gaps from the tracking tokens. After reading these, I started to wonder why we don’t see this issue more often, because from my understanding:

A gap of up to 50 sequence values is very much expected with the default allocation size of 50.
If few events are being generated by a process, it is expected that some of these gaps are closed after a long time.
Axon will clean up gaps if they are older than a minute by default. So if the process generates less than 50 events per minute, the gaps may have bean cleaned from the token by the time the events are written to the database.
In this case, the events will never be processed and there won’t be any error messages, which is exactly what we experienced.

I’m still not sure, however, if this is the cause because I would expect it to happen much more often than we saw it in the past. So there seems to be something - probably a lack in my understanding - that prevents it from happening.

We noticed that there was a rolling redeployment of the application around the time we experienced the issue. Is there a difference in the cleaning up of gaps when an instance claims the tracking token for the first time? Could this explain why we are not seeing the problem all the time?

Or is there a fault in my reasoning above and the handling and cleanup of gaps cannot cause the issues I expect?

Thanks for your help!

tomci · November 14, 2024, 10:17am

Hello, were you able to find out why is this happening? It seems like same issue we are encountering: Event Handler not invoked in some caes

lbilger · November 14, 2024, 9:50pm

Hi Tom, we think that it was due to the allocation size of the sequence being greater than one, in combination with multiple instances generating events. This can lead to events with a lower global_index being created later than ones with a higher global_index. If this is the case for you, you would see that the order of the timestamps is not the same as the order of the global index. If this is the case, you could check this post for one way to fix it by changing the allocation size.
Please let me know if you have the same issue and if you could fix it this way.

Gabriel_Shanahan · November 21, 2024, 2:40pm

For anyone encountering this, I would recommend using the solution described here: Multitenancy: Dangerous AssociationValueEntry entity definition - #5 by Gabriel_Shanahan