Handling Poison Events - do we have some shared best practices?

Slawomir_Siudek · August 14, 2020, 8:22am

Hello everyone

I just went through case never ending failing loop in some projection. The projection logic wasn’t able to progress because of two things:

Processing the event throws exception in eventhandler and can’t change projection in database, and
the same events is delivered to projection in infinite loop.

As you can guess, It was obviously the most important projection my system. From users perspective ‘system does not work’ because they do not see changes on user interface.

I can imagine some solutions // workaround, but meybe we have already created hints of ‘how to’ handle poison events with Axon Framework

Could you share some articles / links related to that problem please? Or, maybe’ it is good proposal for a new subpage in Axon documentation website.

Cheers Slawek

Bert_Laverman · August 17, 2020, 7:50am

Slawek,
poison pills are a pretty generic problem, and dealing with them can take a lot of skull-sweat. With message-queues that use transactional processing, the typical scenario involves: (1) start transaction (2) consume next message (3) process (4) produce a result (5) commit consumption and production. Given an uncaught exception in processing, the consumer will not proceed beyond the poison pill. The root-cause is having an uncaught exception. The strategy to follow if you’re unable to prevent it, depends on whether or not you can skip the pill, which is usually combined with a dead-letter queue or something like it. If not, you really have to stop and shout for help.

Having a poison pill at all is generally a big problem in an event-based architecture. In Java, the problem commonly originates in the use of unchecked exceptions, which are not declared in the method’s signature and thus passed “undetected”. Having to specify and handle exceptions tends to be a complex job, further complicated by the default approach with streams and lambdas (see the standard functional interfaces) to not specify any, so everyone flocks to unchecked ones, and you’ll get exceptions you never saw coming.

In the context of a CQRS projection, you’ll probably have to see this as a very serious error, that needs to be added to your test set because the read model has to be correct. As an approach, depending on the requirements on the data, you should flag the model (or, if possible, part of the model) as invalid, fix the code, and regenerate.

Bert Laverman