Reliable commands dispatching

paco12 · June 26, 2023, 11:28am

Hello,

In our application we are using Axon Framework (4.7.4) and Axon Server (4.6.11) in micro service environment. In such environment we have many Sagas to control consistency of our distributed transactions. One of the most important aspect is reliable commands dispatching and more specifically commands from Sagas. There are cases, when commands processing may throw exceptions that reflects transient or non transient reasons. Respectively if exception is transient Axon should repeat sending of the command until it succeeds.

In our configuration we setup a custom retry scheduler, which extends org.axonframework.commandhandling.gateway.IntervalRetryScheduler and overrides isExplicitlyNonTransient(Throwable failure). The problem we face is that all Throwable failures are of type CommandExecutionException, that wraps another failure of type AxonServerRemoteCommandHandlingException. This is always the case, no matter if the exception relates to db connection pool (clearly transible) or some type of custom validation (clearly non transible).

This is extremely important topic, because when a command sending fails in the middle of a Saga, we have inconsistent transaction, that will never finish and a zombie Sagas left.

For the time in order to work out this issue we use failure events, instead of Exceptions in command handlers and treat all command dispatching exception as transient. But, there are some rare cases, when fatal Exceptions are thrown and results into manyyyy retries.

Could you please suggest, is it possible to have more control and context for the exact type of thrown exception in isExplicitlyNonTransient, especially for exceptions, that occur in the same JVM where command is dispatching. Most usual case is database connection pool exceptions, that clearly should be treated as transient.

Thank you

Steven_van_Beelen · July 17, 2023, 12:03pm

I am confident I can shed some light on your predicament, @paco12!

Axon Framework will never serializer and deserialize any Exception.
For one, the stack trace is not always something that’s easily de-/serialized.

However, more importantly, in a distributed application landscape, there is no guarantee that the Exception is present on both applications’ classpaths.
As such, Axon Framework decided to wrap any exception into a CommandExecutionException to ensure you are not faced with an unusable deserialization exception whilst, in reality, you may face problematic command handling issues.

The downside of this is the problem you describe.
You expect behavior as if you’re still in the same application instance, while in essence you receive a generic communication exception. The fact you’re dealing with a distributed system communicating through Messages thus becomes clearer.

Axon Framework does have a way for your to specify the commands in clearer terms, though.
The CommandExcecutionException has an Object details field.
You can add any form of data in this details object, thus allowing you to react accordingly to specific exceptions.

On top of mind I can think of two samples we have showing how to deal with distributed exception handling:

AxonIQ’s code samples repository has a distributed exceptions module.
A presentation I’ve done several years back, wherein I introduce Axon Framework’s @ExceptionHandler to add details to the CommandExcecutionException. You can find this presentation here and the sample code here (step 6 of the project adds the @ExceptionHandler solution shown in the presentation).

By taking either route, you can thus add the required details to your exceptions to react accordingly in your IntervalRetryScheduler implementation.

Concluding, I hope all this will help you further, @paco12!
If questions remain, be sure to reach out!

paco12 · July 20, 2023, 11:25am

Hi Steven,

I absolutely agree with all that you wrote. But probably I didn’t make myself clear, so I’ll try to explain again.

The core of our problem is how to guarantee, that when Saga (for example) sends a command, that command is reliably dispatched ONLY once! My understanding is that sending a message is a transactional operation, that happens inside the current JVM process. If an exception occur while sending a command, this should be a local exception, that is not needed to be wrapped in CommandExecutionException.

What I mean is, that once I say ‘send’ command - I want it to be send. When the command is dispatched and handled by the remote aggregate it may fail there with exception or failure event, but that is another story. We are having troubles with reliable commands sending - commands are send, but not dispatched. And I believe this is a problem with Axon and should be handled there.

Meanwhile we are trying a lot of tricks, but they are not working fine. We are still experiencing commands, that are not dispatched at all, or other that are dispatched several times. I can assure you, that all this is starting to become a disaster and we need to find a good solution for it.

I hope, you understand my problem. I’ll wait for you answer or further questions.

Thank you

Steven_van_Beelen · July 20, 2023, 3:35pm

To be honest with you, if dispatching would be the problem, I would expect a AxonServerCommandDispatchException to be thrown, not an AxonServerRemoteCommandHandlingException.

So, to figure out what is actually happening, can you perhaps share stack traces of the predicament you’re facing, @paco12? I would like to understand why you’re convinced a dispatching/sending exception is occurring instead of a handling exception.

Secondly, I assume you’re using the Standard Functionality of Axon Server?
If, instead, you’re using Enterprise Functionality, I am confident our Axon Server support team would love to come into contact with you.

If your business is relying on it heavily, know that setting up a call with AxonIQs engineers is 10 times faster than asking questions on our public forum. Although we aim to provide guidance as often as possible, all AxonIQ engineers reply here on a best-effort basis.

paco12 · July 21, 2023, 8:13am

Hi Steven,

Thank you for your response. Right now I’m in a process of debugging what exactly happens. Last time when this problem happens, I have an evidence, that it is caused by SQL connection error. Then I was able to create a reproducible situation and fix it like I wrote in my first post. Now it happens again, but this time it is not clear what exactly Exception triggered it. It happens on production and I’m trying to reproduce it on stage env, but without success for now. Most probably as you said, another type of Exception has been thrown, that break the sending. What I’m thinking is to treat all exceptions as transient - just return false in IntervalRetryScheduler.isExplicitlyNonTransient. As I wrote in our command model we don’t use Exceptions, all failures are represented with a failure events and all commands are idempotent. I’ll spend a little bit more time in debugging and trying to reproduce it. After that I’ll try to evaluate what are risks in treating all exceptions as transient.

I’ll be happy to contact and talk with your support team. I’m asking this here, because for me this is very major and fundamental topic, that I expect Axon to take care for. I thought I’m missing something… In any case I love Axon very much regardless of all the problems I face.

Best regards

Steven_van_Beelen · August 10, 2023, 1:21pm

Thanks for the kind words, @paco12.
Know that the team here appreciates this kind of praise a lot.

Now, back to your scenario:

Although I agree exceptions like this can be clearly transient, as stated earlier in my post, Axon Framework cannot differentiate between those at this stage. It would require for the CommandGateway, RetryScheduler, or the command handling logic wrapping exceptions in a CommandExecutionException to have a mapper from N-different types of exceptions. Coming from N-different dependencies.

As much as we’d like to provide the support of the box, we need to make a decision about where the Framework’s “work” ends and the user’s work begins. As you may have guessed, we decided not to enter the complete space of possible transient exceptions.

Hence, the mapping of (your application’s) exceptions lies with you and the team.
It is this mapping where my original post comes in.

You can (1) add an @ExceptionHandler to your Saga to capture any Exception or (2) add a MessageHandlerInterceptor that captures any exception from your Saga.
In either approach, you would catch all issues from within your Saga, providing you a platform to map the predicament to something useful for the dispatching side.

You could, for example, have this mapping function make the decision for you whether the exception is transient, yes/no, and add this boolean as the details of the CommandExecutionException. By doing so, your RetryScheduler will be able to check whether it is dealing with a transient CommandExecutionException or not.

Let me know whether this sounds like a feasible way forward, @paco12! I am here to ensure the journey throughout Axon Framework will be a pleasant one
And, again, if you want an answer sooner than (in this case) 20 days, be sure to reach out.