Hi!
My colleagues and I investigated an issue with our application where events were not processed by the tracking event processors. There were no log messages pointing to any processing errors.
Looking at the domain_event_entry
table, we found that the order of the timestamp
s was not the same as the order of the global_index
value. Ordering by global index, we found a sequence of exactly 50 events with out-of-sequence timestamp values spread over a period of about an hour. Our assumption is that this was caused by one instance of the application allocating a block of 50 sequence values (the default in Hibernate), but taking some time to use them.
Having explained the order of events in the table, we are still not sure why some of the events were not processed and if this is even related. I read some articles about gap handling and the default parameters for the removal of gaps from the tracking tokens. After reading these, I started to wonder why we don’t see this issue more often, because from my understanding:
- A gap of up to 50 sequence values is very much expected with the default allocation size of 50.
- If few events are being generated by a process, it is expected that some of these gaps are closed after a long time.
- Axon will clean up gaps if they are older than a minute by default. So if the process generates less than 50 events per minute, the gaps may have bean cleaned from the token by the time the events are written to the database.
- In this case, the events will never be processed and there won’t be any error messages, which is exactly what we experienced.
I’m still not sure, however, if this is the cause because I would expect it to happen much more often than we saw it in the past. So there seems to be something - probably a lack in my understanding - that prevents it from happening.
We noticed that there was a rolling redeployment of the application around the time we experienced the issue. Is there a difference in the cleaning up of gaps when an instance claims the tracking token for the first time? Could this explain why we are not seeing the problem all the time?
Or is there a fault in my reasoning above and the handling and cleanup of gaps cannot cause the issues I expect?
Thanks for your help!