Behaviour on error when processing events in batch

Laura_Winnen · February 25, 2021, 11:01am

Hello

We want to configure our event handlers for projections to process in batches to increase the replay speed, but while testing we experienced some issues when some events fail.

We have configured our eventlisteners to stop and retry on exception:
public ListenerInvocationErrorHandler listenerInvocationErrorHandler()
{
return PropagatingErrorHandler.INSTANCE;
}

If we combine these:

If event 200 of a batch of 500 fails, will the first 199 get processed? How is this done?
How can we easily find out which event actually failed from a batch?

Steven_van_Beelen · February 26, 2021, 10:02am

The Event Processor in Axon will regard the entire batch as a single transaction. It does so by using a special type of UnitOfWork, namely the BatchingUnitOfWork.

This means that if handling of one of your events fails exceptionally, the entire batch is rolled back. Granted, whether this means the previous 199 are processed depends on what’s done in the Event Handler. If your Event Handlers for example perform operations that cannot be rolled back, like sending an email, then those previous 199 events should be regarded as processed. Simply put, if the operation performed in the event handler cannot be rolled back, then it would simply be handled.

To easily find out which event failed inside the batch, it would be smart to adjust the ListenerInvocationErrorHandler to not only propagate the exception (which causes the rollback) but to also log it. Thus a custom implementation would be required, essentially a version that combines both the PropagatingErrorHandler and LogginErrorHandler should suffice.

Hope this clarifies your option @Laura_Winnen. Make sure to reach out if you have a follow-up question!

Laura_Winnen · February 26, 2021, 11:25am

In the example where the 200th event of batch of 500 is invalid and we would like to skip this (and only this, not the preceding 199), how would we do this?

Steven_van_Beelen · February 26, 2021, 11:34am

If skipping is fine, that means the ListenerInvocationErrorHandler should “be smart” to regard an exception for that event to not be a problem.

This means you will have to construct a custom implementation of the ListenerInvocationErrorHandler which handles the exceptions. Furthermore, it would then check whether the given exception and event combination should cause the batch to fail yes/no. If it should, you simply rethrow the exception (which is what the PropagatingErrorHandler does). If it should not fail, just logging should be sufficient (which is what the LogginErrorHandler does).

Laura_Winnen · February 26, 2021, 12:02pm

Skipping is most of the time not fine; we only use it in exceptional situations where an event failed that should never fail (mostly related to a bug somewhere).

When we run event listeners without batches and the situation occurs, we advance the token in the database manually, but there is no easy solution like this when running in batches?

Koen_Verwimp · February 28, 2021, 3:13pm

I have a similar situation where i have a batch of 500 messages posting to a graph database in 1 batch. At commit time there seems to be an error with one of the messages inside the batch. This end up being an event that got in the eventstream by mistake (due to a bug).

Questions:

It happens at commit time, how to know which event is responsible for the problem situation. It can be any of the 500.
What happens with the batchsize when an error occurs? Does Axon retry with the same batchsize of 500 or does axon decrease the batch size in orde to get through it.
Is there anyway we could mark this problem event to be skipped? (i dont think the ListenerInvocationErrorHandler is really a solution for us)

Steven_van_Beelen · March 1, 2021, 9:16am

@Laura_Winnen, well, if skipping is not fine when something fails, why did you ask how you would skip a failed event? Maybe I am not following your use case here…some elaboration would be helpful to better understand your situation, either from you or from @Koen_Verwimp.

On any note, you can make the ListenerInvocationErrorHandler smart, as I said earlier. If you check the method you’d implement when creating a custom ListenerInvocationErrorHandler, it gives you the following inputs to react on:

public interface ListenerInvocationErrorHandler {

    void onError(Exception exception, EventMessage<?> event, EventMessageHandler eventHandler) throws Exception;

}

You get the exception, so that you can validate what you want to do on the given exception.
Furthermore, you can check the exact event which was handled that caused that exception. I believe this answers your first question too, @Koen_Verwimp. You’d simply now what event failed, as the ListenerInvocationErrorHandler provides it to you.
You even get the entire EventMessageHandler which was used during invocation. This could, essentially, allow a retry of handling the event if required.

I hope this clarifies the position of the ListenerInvocationErrorHandler. Granted, the default implementations provided don’t sound like they’re suited for your situation. Then again, the situation begins to sound very specific, which wouldn’t allow a generic-framework implementation anyhow.

Now, there are still two questions left from your end too @Koen_Verwimp (assuming 1 has been answered through my explanation of the parameters on the ListenerInvocationErrorHandler:

The TrackingEventProcessor will simply proceed with the defined batch size whenever you’ve pushed it into error mode. “Pushing it into error mode” means you have propagated the error up through all levels, thus on the ListenerInvocationErrorHandler and the ErrorHandler. The TEP thread which failed will move in an incremental back-off retry loop, but the chances are pretty high another TEP thread will simply pick up the token and proceed as usual. Thus, even if the failed thread would go for lower batch size, it wouldn’t really cause any impact. This is arguably one of the reasons why we’re working on another type of Event Processor which provides more flexibility.
I am uncertain why you feel the ListenerInvocationErrorHandler (LIEH) isn’t suited here. The LIEH is the place which is first invoked when your event handler fails. This allows you to choose what you want to do when that exception occurs. Do you want it to fail hard? Rethrow the exception. Do you want to retry? Invoke the given eventHandler. Do you want to ignore or skip the event? Simply do nothing.

To conclude, and as I feel it’s the most important gist out of the above: if you want an event handling exception to not cause your batch of events to rollback, simply catch the exception in the ListenerInvocationErrorHandler and ignore that it happened.

Laura_Winnen · May 18, 2021, 9:16am

@Steven_van_Beelen
Thanks for your explanation.

The difference in our situation is that we do not want to skip events with the exception in general. Most of the time we want to have or eventhandler stop and retry. It is really and exceptional situation where eg. due to a bug in the past, the event has incorrect data and will trigger an exception. Since we cannot change this event, we need to find a way to skip this (and only this event) when (re)playing event handlers.

When we used to run them in batches of 1, we just updated the token in the database to the next one, but when processing in batches, it is unclear how we could fix it.

Steven_van_Beelen · May 18, 2021, 10:09am

Sure thing @Laura_Winnen, glad to help.

That means the ListenerInvocationErrorHandler should be smart enough to know about these incorrect events to be able to skip these (by simply logging the exception/message and proceeding). Furthermore, it should stick to the more common approach where a retry should occur, as that’s the default approach.

How to get these smarts is dependent on the system itself really. Maybe it’s a fixed set of specific events where somebody put in the wrong data. Or maybe it’s an old format of event you no longer comply with, but cannot upcast correctly. Or maybe you’re dealing with events you’d want to be filtered out entirely. Whatever it is, that might change how you keep the knowledge of which events should be skipped. In short, this is a very long “it depends” when it comes to how you know which events to skip.

Unsure whether it helps, but you can figure out whether a replay is underway when you’re in the ListenerInvocationErrorHandler. The API of the ListenerInvocationErrorHandler provides the EventMessage that failed. With the EventMessage in hand, you can validate whether it is a replayed event by invoking ReplayToken#isReplay(Message).

Hope this provides some guidance!