Configurability issues with AbstractRetryScheduler

dmurat · August 13, 2021, 9:43am

As stated in the documentation, RetryScheduler will not schedule a retry for explicitly non-transient exceptions. With the current implementation, checked exceptions are considered non-transient, but also AxonNonTransientException.

In command handlers, I’m using CommandExecutionExceptions for communicating business violations to the dispatching side. Unfortunately, with the current implementation of the AbstractRetryScheduler.isExplicitlyNonTransient() method, all my CommandExecutionExceptions will be treated as transient, and commands will be retried.

I can handle the situation by implementing my custom extensions of AbstractRetryScheduler. However, I found this unfortunate, as I’m trying to deal with a pretty common scenario, I believe.

I think the problem can be avoided by adding something like the nonTransientFailures property (list of exception classes) to the AbstractRetryScheduler and its builder. The method isExplicitlyNonTransient() can then be modified to check if any configured non-transient exception classes are assignable from the actual failure.

In addition, for more complex cases, we can also add the nonTransientFailurePredicate property and leave non-transient checking logic to the user.

It looks to me, with the changes above, we will significantly reduce the need for custom implementations of AbstractRetryScheduler. WDYT?

Tnx

Steven_van_Beelen · August 13, 2021, 11:47am

I believe great minds think alike since we have an issue with this exact idea!
You can find it as #1723 on our GitHub repository, by the way.

It isn’t prioritized right now though.
If you’d like to provide (another) pull request, that would be great of course!

Don’t feel obliged though.
We will get to the issue ourselves eventually.

dmurat · August 13, 2021, 12:35pm

Oh, I forgot to check issues

Well, as I already created a custom extension of AbstractRetryScheduler, I might create another PR. Let me see if I can come up with something over the weekend

While we are at AbstractRetryScheduler, I noticed another minor issue related to the logging in scheduleRetry() method. I think logging levels are a bit off.

The first logging statement containing “Will retry {} more time(s)” is at the info level and that looks appropriate. One would probably want to know something is happening.

Regarding the second logging statement containing “Giving up permanently.”, I think it should be at the warn level. When multiple retries fail, it is pretty significant. Stacktrace logging is also appropriate for that case.

The third logging statement containing “and will not be retried” should be at the debug level, I think. Otherwise, if RetryScheduler's logic concludes that exception is non-transient and an exception should not be dealt with at all, it will still log the exception anyway. This information might be useful during development and testing, but not really in production (especially a stacktrace log). Hence, it looks to be appropriate to decrease a logging level for that case.

If I’ll work on PR, is it ok to tackle those logging levels too?

Tnx

Steven_van_Beelen · August 13, 2021, 1:05pm

I’d very much like to dive into @allardbz’s mind in 2012 (when he added the RetryScheduler) to understand why he chose these levels. But I am obviously not able to.

From my end, I understand your train of thought and think it makes sense.
So, sure, make that logging adjustment whilst you’re adding the configuration option.

dmurat · August 14, 2021, 1:01pm

Here is a PR: Add configurable options for checking failure transiency by dmurat · Pull Request #1910 · AxonFramework/AxonFramework ·

allardbz · August 24, 2021, 9:55am

The reasoning behind these log levels is that when “giving up”, the retry scheduler notices the callback method of the failed command. That callback gets a reference to the exception. If any logging (on higher levels) should be done, it’s in that callback method.