Some fundamental questions ccn RetryPolicy.retry in the asynchronous error handling architecture

Christian_Bonami · August 13, 2014, 9:20am

Hi Allard,

We’re using Axon in 2 projects here, and we ran in a few architectural problems that we’d like to solve.

I try to summarize our questions/observations as much as possible; please correct me if one of the following statements is based on false assumptions or wrong understanding about the Axon framework.

"In RetryPolicy.retry is apparently useless and even dangerous in a real production-context, as events that are scheduled for retry, won’t be automatically retried after the server is restarted (non-durable rescheduling)." (in case an exception is being handled asynchronously, eg by Axon’s DefaultErrorHandler)

The reason why one would want a transaction (leading to a transient exception) to be rescheduled, is that there is hope that the system will heal itself. It might be that a resource was temporarily overloaded but a few moments later it is functioning correctly again.
The (default) ErrorHandler logs the - presumably- transient exception (as WARN) : nobody should have a look at it (yet), we rely on the system; it will retry (‘till the end of time’) until the underlying problem is fixed and the event(s) get processed correctly; the user is safe : his intention has been recorded by the system.

But - in reality, and probably by design - the system won’t retry till the end of time: a server-restart is enough for the event not to be rescheduled again. So we cannot be sure that the system auto-heals. Consequently we have to ERROR-log the exception, and :

somebody (infra-team, a developer) needs to look at it right-away anyway, as we are not sure that the event will ever be processed, and the system might be left in an inconsistent state, with nobody really knowing about it (the exception occurred asynchronously, so even the user doesn’t know about it, unless we ‘push’ the exception to him/her - which, btw, we do using websocket/sockjs). Consequently, we risk events to be never processed, and a system that is eventually non-consistent without anybody realizing it until it is too late.
when that person inspects the error-log, and looks at the exception, there is no simple way for him to determine if the ‘erroneous’ event got processed by that time (the system auto-healed). Also, when the underlying source of the problem is solved, there is no easy way to check which set of ‘erroneous events’ got solved, and which set of events didn’t.

Note that in our custom ErrorHandler-implementation, we look at the (current timestamp - timestamp of orignal erroneous event) to determine if an error is recent (first minutes after original appearance of the error) or not-recent (eg an hour of retrying didn’t solve the problem). In the latter case, we alert infra-team :“please have a look at this event/exception and fix the underlying problem if you can”.

We also introduced this custom errorhandler in order to get the chance for bugs to get fixed. The user has successfully recorded his intention. It might be that - due to a bug in a saga - the system doesn’t get consistent. So we mark the event to be rescheduled. We fix the bug and release a new version of the application. The event - that was rescheduled - is picked up again, and the system gets eventually consistent. Everybody happy.

"A RetryPolicy.retry that potentially never ends is undesirable anyway, as the queue of rescheduled events can become very large and pointless/wrong from a functional perspective"

It would be much better to retry a configurable number of times and/or time, and then give up on an event.
Actually, that is what we tried to do in our initial custom ErrorHandler implementation: when a problem doesn’t get solved within a week, we abandon it (RetryPolicy.proceed()) and notify a few people in the organisation, so that they can look into the problem and fix the underlying bug (by that time, the problem should most-probably be categorised as a bug).
One could also ask himself: what’s the point in rescheduling an event that originally happened a week or a month ago. Is this functionally acceptable ? Will the problematic event (eg: a delete of something) not be overruled by all other events coming after (dealing with the so-called deleted entity) ?

"There is no simple way to determine, monitor and fix the set of problematic events. Replaying also seems so ‘heavy’ or ‘unhandy’"

See above. Do you expect the infra-team/developer to visually inspect the EventStore ? And what should he do when he finally found the problematic event ? He should/can’t replay it on a running system, can he ? There is also no simple interface (GUI or so) for him to do this kind of replay-operations, unless we build it ourselves…

We thought of

switching off retry policy and rollbacking erroneous asynchronous transactions immediately (RetryPolicy.skip())
a kind of error-queue where problematic events are parked in a durable way (eg: RabbitMq)
the event is rescheduled via the mechanics of the queue – our problematic saga would be subscribed to that error-queue too, and treat events coming from that queue
the infra-team/dev-team can inspect this queue and manage the set of problematic events from there

But then again ? Is this the way to go ? This starts to smell like an ‘unmanageable’ or ‘expensive’ solution.

"A RetryPolicy.retry in combination with a SequentialPolicy should be avoided, as it will block the processing of all events (at least, in the cluster with the sequential policy)"

At this moment we have a default setup. One cluster with a SequentialPolicy() and a default RetryPolicy.retry. We saw that this is very dangerous setup. One of our saga’s asynchronously threw a transient exception. After that not a single event was processed, but from the client/commandhandler-perspective everything went fine. The resulting events are simply added to a wait-queue, waiting for the problematic event to eventually succeed (after retrying an indefinite number of times).
This makes the system completely unusable. Hence, again, my first statement : RetryPolicy.retry is dangerous.

We thought of:

using sequentialPerAggregatePolicy --> only a small part of the events get hold up, for the rest the show can go on
using multiple clusters:

a cluster for the query-model updaters
a cluster for the saga’s

=> when the saga fails, at least the query model gets updated

Allard · August 13, 2014, 6:19pm

Hi Christian,

Most of your statements about retry policies are completely correct. Retrying should be done sparsely and carefully. Personally, I generally give up after at most 5 tries. If an exception is not explicitly transient, I don’t even retry at all. Retrying for more that a few seconds/minutes is something I would avoid as much as possible.

If durability is important (which most often is), I would strongly recommend against using an asynchronous cluster with an in-memory queue. You’re best off using an amqp message broker and read your events from there. Most of these message brokers can automatically move failed events to an error queue. If your infra team monitors that queue, they know exactly when to act. When possible, they can resend failed objects to the queue they came from.

If you use any form of asynchronous processing, make sure to monitor queue sizes. Alternatively, limit the number of items you can put in the queue. And still put a monitor in the queue’s size.

If you go for the AMQP solution, you’ll find that you can easily do so with the ClusteringEventBus and the AmqpConnector. Each cluster will read from its own queue. You can have multiple clusters watch the same queue, either as competing consumers, or as each others backup.

Hope this helps.
Cheers,

Allard