Eventprocessors are not fired sometime

I a using axon 4.4.5 and I am seeing a weird behavior where in few cases after the commands are sent through the command gateway to the commandhandler and within the commandhandler I am loading the aggregate and firing a domain event, the corresponding event handlers are not getting fired. Please note that this is happening intermittently and I am unable to find the root cause of it . We are using oracle as our event store and tracking event processor as our processors. The event handler is in a different microservice and all the event processing is getting handled through the tokenentry table as per the default behavior.

Would you be able to share how often this problem occurs @Ashwini_Kumar?
This should, obviously, not happen at all.

Added, would you be at leisure to share the following with us too:

  • Is this happening for every TrackingEventProcessor (TEP) in your system, or just for one/several?
  • Are all of these TEPs running multi-threaded?
  • Are all these TEPs receiving their events from a single source? So, no combination of AxonServer/RDBMS/Kafka/AMQP?
  • Are you using the SequenceEventStorageEngine perchance?
  • Just to reaffirm, the tokens are getting passed the events which you have noticed not to be handled, correct?
  • Do you have any logging in place which confirms this problem?

In the meantime, I have one suggestion you could give a go. You say you are using Oracle, which means you are using Axon’s EmbeddedEventStore as the EventStore implementation. Quite some time ago we introduced the option to force the EmbeddedEventStore to not reuse the same channel for several consumers. You can find this behaviour as the optimizeEventConsumption boolean present in the EmbeddedEventStore.

By default, this behaviour is turned on. I would like to ask you to turn this off and see if the behaviour persists. to turn it off, you got several options:

  • Use the builder of the EmbeddedEventStore, by invoking the EmbeddedEventStore.Builder#optimizeEventConsumption(false) method during construction.
  • You can provide a system property to toggle this behaviour. By defining a System property with key optimize-event-consumption and value false, you would achieve the same as through the constructor.

So, please give that a try and come back whether it resolves the problem at hand. Furthermore, if you could give answers to my follow-up questions we might be able to deduce another angle which could cause this problem.

Thank you Steven for your quick response, I will try to make the changes as per your suggestion and come back with response.
Added the response to your questions:-

  • Is this happening for every TrackingEventProcessor (TEP) in your system, or just for one/several? — This is happening randomly with all the processors but the frequency is less like 1 out of 150 requests.
  • Are all of these TEPs running multi-threaded? — Yes
  • Are all these TEPs receiving their events from a single source? So, no combination of AxonServer/RDBMS/Kafka/AMQP? — Yes receiving events from a single source. We have oracle as our event store, all the events are getting persisted in the domaineventEntry table and TEP is processing from there.
  • Are you using the SequenceEventStorageEngine perchance? – No
  • Just to reaffirm, the tokens are getting passed the events which you have noticed not to be handled, correct? – Yes the tokens are getting passed and I can see the same in tokenentry table with the tokens getting updated with the latest globalIndex column value of the domainevententry table

Hi Steven,

The changes that you asked have been done and currently testing is in progress as this issue happens intermittently will test for few days before confirming, In the meantime can you let me know what will be the extent of performance impact of this change as with this distinct event consumers will not be reading from the same even stream and will stay on a private stream and this means more database resources would be used.

To be honest with you, it is hard to quantify the exact impact you should expect from making this change.
What I can tell you though, is what it means from within Axon Framework.

Whenever the optimizeEventConsumption is shut down, it means the EmbeddedEventStore will construct a stream per EventStore#openStream(TrackingToken) operation. The openStream operation is performed by each thread of the TrackingEventProcessor which has a claim on a TrackingToken.

So, let us assume we have a scenario of two application instances, both running the same TrackingEventProcessor (TEP) called “my-processor”. It is configured such that there is a total of 4 segments and 4 threads per TEP instance. When the “my-processor” TEP starts, it will start the four threads which try to claim tokens. Let us further assume your two application instances will take an equal load of two segments each. From within the EmbeddedEventStore, whenever you have disabled the optimizeEventConsumption, you would thus open two streams to your RDBMS to receive events. You can assume this to increase whenever the number of TEPs increases on a given application.

So, in short, you would have more connections to your RDBMS. Essentially a connection per thread opening a stream. On most scenarios this does not prove to be a problem, by the way. It simply means your RDBMS is read more often. As such I am not anticipating massive troubles on your end.

Note by the way that this approach would be default whenever Axon Server is in place. The reasoning behind this is that in the time the TEP existed solely with the EmbeddedEventStore (so pre-AxonServer) would in most cases not share the same event stream any how (which is what the optimization does). This doesn’t happen, because the sequencing policy deciding by which thread/segment an event should be handled, does not construct a fully equal distribution. This in turn does not happen, because by regular design there is not default number of events per stream. There is a ballpark figure, yes, but in virtually all applications Aggregate’s have largely differing event stream sizes. As such, the optimization (which is rather complex) was not employed when we introduced Axon Server as a source. Instead we made the streaming of events as efficient as possible.

Added, we have quite recently figured that such an optimization (thus opening a single stream) as far more reasonable to be implemented from within an EventProcessor implementation. As such we have worked on the PooledStreamingEventProcessor (which you can find the PR for here) to do exactly that.

Concluding, I find it tough to give you an exact number here @Ashwini_Kumar, as it differs strongly per application and RDBMS instance and configuration. What I do know, is that everybody we recommended this approach to, didn’t come back to us that performance was impacted in a troublesome manner. I hope this alleviates some of your concerns.

And, let us know if the problem is resolved through this!

Hi Ashwini,

What isolation level is configured on the Oracle DB system?
Preferably, Read Committed.

Thank you Steven for the explanation, and we have atleast 1 db connection available for each tep thread so we should be good…
I will update in the same thread if the issue is resolved or not after few more days of testing.

1 Like

Yes read committed is the isolation level

1 Like

Hi @Steven_van_Beelen Just noticed the same issue happening again even after making the above change. One of the event handlers did not get triggered.

Thanks for coming back to us with this insight, albeit not the desired outcome, of course.
Let me ask a couple of follow-ups to try to figure out what the cause exactly is.

  1. How is your domain_event_entry table configured at the moment?
  2. What type of sequence are you using to generate the global_index on the domain_event_entry?
  3. Is this sequence used for different tables too?
  4. Is your Event Store’s data source (Oracle in this case) only used for the domain_event_entry or does it contain other tables too? If so, which?
  5. None database-related question → do you have logging which confirms the failure of handling the event?

Answers:-

  1. How is your domain_event_entry table configured at the moment? – can you clarify this question a little
  2. What type of sequence are you using to generate the global_index on the domain_event_entry? – It’s an oracle sequence that we are using.
  3. Is this sequence used for different tables too? – No, this sequence is specifically for domainevententry
  4. Is your Event Store’s data source (Oracle in this case) only used for the domain_event_entry or does it contain other tables too? If so, which? – all axon event store specific tables like tokenentry, sagaentry along with few app specific tables are also there in this data source
  5. None database-related question → do you have logging which confirms the failure of handling the event? – No logging as such, if you can point me out as to how to enable this logging I can add it.

Let me, prior to responding to your response, ask me some additional things to further understand the situation. I hope this doesn’t confuse the thread to much, but I feel it is important to get to the bottom of this. At any note, here my additional questions:

  1. How have you spotted that an event is processed, but not handled exactly? Could you elaborate on how you have spotted this problem?
  2. Do you have gaps between your events? “Gaps” in this case mean that the globalIndex field of your events doesn’t increment, but has holes in it.

Now, let me go back to your response, to respond to that in turn :wink:

  1. How is your domain_event_entry table configured at the moment? – can you clarify this question a little

Sure I can! I was looking to understand how your application constructs the domain_event_entry. Do you create the table by hand? Through Spring Boot auto-configuration? Flyway? Liquebase? Any specifics on this might be helpful, so just let us know.

  1. Is your Event Store’s data source (Oracle in this case) only used for the domain_event_entry or does it contain other tables too? If so, which? – all axon event store specific tables like tokenentry, sagaentry along with few app specific tables are also there in this data source

Alright, so a single database which stores all the application data you have, correct?

  1. None database-related question → do you have logging which confirms the failure of handling the event? – No logging as such, if you can point me out as to how to enable this logging I can add it.

Axon’s MessageMonitor solution (as described here) would go a long way to know how many events are being ingested. However, this would provide metrics for a specific EventProcessor over any type of event really. What would be good to know is whether there is a certain type of pattern in the events which are periodically skipped in your applications.

Things like is it always a different event? Are there specific positions in the stream which seem to fail for a TEP? You’d be best suited by using the axon-micrometer dependency. This dependency adds micrometer as a solution towards metrics for your application. By adding the dependency in a Spring Boot application, Axon will automatically set up all the default MessageMonitors for any of your message dispatching and handling components. I’d recommend micrometer, as it allows you to define dimensions. It is these dimensions which enable you to see which events are going through exactly, as you can add the event type to the monitored output. But at any note, it will show you the number of ingested, ignored, successful and failure accounts of any of the messages flowing through your set up.


Hope this will help us further @Ashwini_Kumar. Hoping to hear from you soon.

we have corresponding event handler for all these events and those event handlers have corresponding logger statements within it and for these cases I do not see any loggers getting printed.

I see for this particular case that there are is a gap between global index values for example I see a global index value of 68495 for an event just fired before this event and a global index value of 68497 for this event for which event handler did not trigger, but as far as I understand these kind of gaps can be there in oracle sequences, please correct me if i am wrong

It is created in the database manually by running a DDL script.

I will check the message monitor document provided by you and also check for this axon-micrometer dependency to see how it works.

hoping to hear back from you @Steven_van_Beelen soon

When it comes to the globalIndex, ideally there are no gaps at all. This is what Axon Server confirms for you for example, but is slightly harder when it comes to an RDBMS, for two reasons:

  1. The commit and insert order might deviate, which could cause an event at a newer-position to be inserted earlier than an event with an older-position. This should be mitigated through the isolation level @Ivan_Dugalic talked about, by the way, which is recommended to be on READ_COMMITTED.
  2. The sequence generator is reused in other positions of your database, to generate automatic identifiers on a query model for example. This is why I asked if your sequence was only used for the domain_event_entry table. Although you already stated it isn’t reused, it might be a good idea to investigate whether there is nothing “abusing” the sequence generator. Whether it is common for an Oracle Sequence to contain gaps is, honestly, out of my knowledge spectrum.

we have a corresponding event handler for all these events and those event handlers have corresponding logger statements within them and for these cases, I do not see any loggers getting printed.

Would you be able to share a sample logline? Maybe there is something I can recommend to add in here to further deduce the issue.

Additionally, we had a chat on another issue (for those interested, this issue), where I explained how to minimize the number of segments through the merge operation. Uncertain if I am pushing this to far, but I’d like to eliminate as many degrees of freedom as possible when figuring this out…so if you have the time, I’d like to ask you to move one of the TEP’s which has shown this problem to use a single thread and segment. Let me know if your time allows this at the moment @Ashwini_Kumar.

  1. The commit and insert order might deviate, which could cause an event at a newer-position to be inserted earlier than an event with an older-position. This should be mitigated through the isolation level @Ivan_Dugalic talked about, by the way, which is recommended to be on READ_COMMITTED. – The isolation level is READ_COMMITTED

Would you be able to share a sample logline? Maybe there is something I can recommend to add in here to further deduce the issue.
one sample logging that I am doing in my event handler
LOG.info(“event processing started”);

so if you have the time, I’d like to ask you to move one of the TEP’s which has shown this problem to use a single thread and segment. – Like I have commented in this post(here), Can I manually delete the row containing the segment 1 for this TEP after ensuring that all events for this TEP has been processed and I stop this app before deleting this row. If time permits please check @Steven_van_Beelen

I’ve just provided a response to the linked post! The short answer is “yes”, but you need to be careful in this process as well.

When it comes to your logline, I’d like to recommend some addition in there.
More specifically adding specifics of the handled event and the position of that event.
To clarify how to achieve this, let me share the following snippet:

@EventHandler
public void on(YourEvent event, EventMessage<YourEvent> eventMessage, TrackingToken trackingToken) {
    LOG.info(
        "Handling {} with identifier [{}] at token position [{}].", 
        event.getClass(), eventMessage.getIdentifier(), trackingToken.position().orElse(-1L)
    );
    // perform the regular logic you have here.
}

Adjusting the logline like so would make it easier to identify which event failed and what its position is in the stream.

As stated on the reference thread too, my apologies for the later reply than usual. Had a little to much on my plate with other forms of support I’d had to give precedence over the discuss channel. Nonetheless, I am confident we’re getting somewhere here. Keep us posted @Ashwini_Kumar! :muscle:

Than you steven. I will make the necessary changes, test and come back to you with the results
Thanks a lot @Steven_van_Beelen

1 Like