Event handling with tracking event processors and exceptions

Marco_Dubacher · October 29, 2018, 10:30pm

Hi

I recently switch from subscribing event processors to tracking event processors. As far as I know I have two possibilities to handle exceptions during event handling.

Possibility 1
I can log the exceptions and continue as if nothing happened.

Problem: My event handlers write data in read model tables for querying purposes. Just ignoring the event causes an inconsistency between the state of the domain and the readmodels.

Solution: Fixing the issue which caused the log entry and replay all events afterwards. Until the fix the system is an inconsistent state. Furthermore I need some sort of alarming that informs me about such incidents.

Possibility 2
Throw an exception which causes the event processor to submit the same event again and again until it could be handled without exception.

Problem: Subsequent events of the same tracking event processor groups won't be handled until the event causing the exception can be handled successfully.

Solution: Implement fine grained tracking event processor groups. So in case of such an incident only parts of the system are affected.

As the possibilities described above are not convincing to me I wonder whether there are other possibilites or solutions. May be I missed something completely.

Thanks for any help.

Cheers
Marco

locorider · November 2, 2018, 12:21am

Hi Marco

It’s not really what you want but:

We follow the fine-graind processing-groups approach that in case of errors or changes we don’t replay that much.
Furthermore we use bugsnag for error reporting and register a default “ListenerInvocationErrorHandler”. Whenever an exception occurs we get notified more or less instant. After that we just replay the eventhandler from the point in time before the incident happened. (if necessary)

I guess it very much depends on the case. How critical is it when something is inconsistent? Can the state be fixed with further events? Can it auto-heal? (e.g. your exception is thrown because some other service is temporarily down)
Is it the latter then you could go with the retry approach otherwise you could use those ErrorHandlers to shutdown the processor in combination with an alarm. (Honestly never tried it. Hope it works

I hope I could help at least a bit.

Grüsse
José

Marinko_Babic · November 2, 2018, 6:16am

Here a new generic concept is needed. Something like an error queue. When the retry fails several times the message should be moved to the error queue. The upcoming messages will be processed normally.

The developer can now reply the failed messages from the error queue once the issue is fixed. If the message has been processed normally it’s removed automatically from the error queue.

This mechanism should be part of the Axion framework.

Thanks
Marinko

Marco_Dubacher · November 4, 2018, 11:52am

Hi

Thanks for the comments.

The approach with the error queue is very interesting. I wonder whether this is on the roadmap of Axon.

Cheers
Marco

Steven_van_Beelen · November 7, 2018, 9:53am

Hi Marinko, Marco,

I can guarantee it is not on the roadmap at this point, but I also agree this is a very interesting idea worth discussing.

I do view this as a potential danger though, as your the changes you make to your query model aren’t always as simple as ‘just replay the failed events’.

That’s a guarantee we cannot provide from a framework perspective, as that’s in the hands of the users.
Nonetheless, making such a thing configurable for users who are certain that is the solution to their problem, would be worthwhile.

Marinko, as your stance is quite clear in this spot, would you be up to start an issue describing the problem?

This opens up a discussion forum for the developers over at AxonIQ, an easier spot to debate on this then on the usergroup.

Thanks in advance for your time on this of course!

Cheers,

Steven

Gerlo_Hesselink · November 9, 2018, 9:25pm

We use a variation of your first solution:

When an exception is thrown we write the exception message in the DB along with the read model instance, so it is marked unusable
When exception is fixed we made it possible (with a temporary listener) to replay only that aggregate on the read model, so it is consistent and usable again.

Of course things become more complex if the read model involves more than one aggregate.
Greetings,
Gerlo

Gerlo_Hesselink · November 9, 2018, 9:35pm

The way we implemented is was using own ErrorHandler wich sends a (not persisted) special Exception event message to the listener at hand. Any listener can handle this event to write the exception to the db or log it, etc.
.
Maybe this is something what axon could give us? : An anntation like @ExceptionHandler on an event listener method which the Exception and eventually the original event Message (causing the exception) as parameter

allardbz · November 16, 2018, 9:10am

Hi Gerlo,

that’s actually an interesting idea. Axon’s handler mechanism is already pretty generic, do it shouldn’t be too hard to implement something like this.

Cheers,

Allard

Steven_van_Beelen · November 19, 2018, 11:10am

Hi all,

Just created this issue to mark this idea. Liking it too!

Cheers,
Steven