In our application we are using Axon Framework (4.7.4) and Axon Server (4.6.11) in micro service environment. In such environment we have many Sagas to control consistency of our distributed transactions. One of the most important aspect is reliable commands dispatching and more specifically commands from Sagas. There are cases, when commands processing may throw exceptions that reflects transient or non transient reasons. Respectively if exception is transient Axon should repeat sending of the command until it succeeds.
In our configuration we setup a custom retry scheduler, which extends org.axonframework.commandhandling.gateway.IntervalRetryScheduler and overrides isExplicitlyNonTransient(Throwable failure). The problem we face is that all Throwable failures are of type CommandExecutionException, that wraps another failure of type AxonServerRemoteCommandHandlingException. This is always the case, no matter if the exception relates to db connection pool (clearly transible) or some type of custom validation (clearly non transible).
This is extremely important topic, because when a command sending fails in the middle of a Saga, we have inconsistent transaction, that will never finish and a zombie Sagas left.
For the time in order to work out this issue we use failure events, instead of Exceptions in command handlers and treat all command dispatching exception as transient. But, there are some rare cases, when fatal Exceptions are thrown and results into manyyyy retries.
Could you please suggest, is it possible to have more control and context for the exact type of thrown exception in isExplicitlyNonTransient, especially for exceptions, that occur in the same JVM where command is dispatching. Most usual case is database connection pool exceptions, that clearly should be treated as transient.