Context
Our application operates in a multi-node environment where one of the node publishes events while another node processes these events asynchronously. Recently, we’ve observed that the async tracking processor is missing some events in production (with no corresponding data in the respective table), but we cannot replicate this issue locally. Our setup uses Axon with an Oracle database.
Analysis
After conducting some analysis, I suspect the issue may lie with the Oracle sequence used for the global index. By default, Oracle sequences are created with the CACHE option enabled. The CACHE enabled sequence does not guarantee the order of the sequence numbers.
Example : In a multi-node setup, one node may generate records with sequences like 21, 22, 23, and 24 at time-x, while another node might create records with 15, 16, and 17 at later time-y (x+1).
Tracking Event Processors
My understanding of Tracking event processor on how it fetch the new events
- The processor queries the event store for new events that have occurred since the last processed token.
- Tracking Token: Each event tracking processor maintains a tracking token, which indicates the last processed event. This token allows the processor to pick up where it left off in the event stream, ensuring no events are missed or processed multiple times.
Let’s take the example above here,
If the tracking event processor is running asynchronously on a different node, the tracking event processor could first process the events and update the tracking token to 24 at time x. Then, the Tracking event processor is trying to fetch the new events from 24, it could not expect the lower than 24 index, causing the 15, 16, and 17 events to be skipped.
So, the unordered global_index column complicates event processing, particularly if event processors rely on the sequence order (as with the Tracking Event Processor).
Actions Items.
- Alter global_index sequence: Alter global_index sequence to use NOCACHE, ORDER. (This might impact performace at insertion time).
- Implement Retry Handling: Implementing retry mechanisms to ensure that events are processed correctly. (It’s an early thought, so I’m not certain if there are any potential issues)
I’d like to know if my understanding is correct. Any additional insights would be greatly appreciated! Thank you