Sagas and Commands failing with exceptions

Dominic_Heutelbeck · October 19, 2016, 2:25pm

Hello.

I was wondering if anybody got an idea on how to trigger a Saga if the command it previously dispatched fails with an exception.

Is there a way to do that? Or would one rather use a command gateway in the saga and wait for the result before proceeding into
the waiting state ? This appears to somewhat defeat the purpose of the saga.

Thank you.

Dominic

Dominic_Heutelbeck · October 20, 2016, 10:00pm

So overall, I concluded, that it is usually a bad idea to have command handlers that may potentially result in exceptions (due to the Saga problem stated below). So I try to make matching failure events available where possible.

However, there is one scenario, where I do not know how to avoid the exception. How about the edge case, that we end up with a version conflict in the event stream due to a race condition or with some stupid buggy code that creates aggregates with the same id.

I cannot catch that exception within an CommandHandler, or can I ? This mechanism is executed when I exit my command handler.

How do I catch an EventStoreException to be able to generate an event which would allow me to end a Saga which is waiting for the conclusion of the command to take the next step? Currently I may end up with a bunch of “Zombie Sagas” in this case.

I am really curious here.

Thank you.

Dominic

Steven_Grimm · October 20, 2016, 10:14pm

One way to solve this is with a command handler callback whose onFailure() method publishes an event. The saga can have a handler for the event and do whatever it needs to do to clean up after the failure. I use that exact approach in my application in cases where a command handler could throw an exception. You will probably want to start a UnitOfWork explicitly in your callback.

-Steve

Dominic_Heutelbeck · October 20, 2016, 11:27pm

Hello,Steve.

Ok, I see the point. Let me just therorize …

What understand is, that you are actually adding the callback to each dispatch of the command from the class which is triggering the command.
So you need to put knowlkedge about the inner workings of the aggregate into its client to make sure to get an event back. How do you specify the callback?
As an anonymous class or a lambds within the Saga itself ?

This will solve the issue in specific situations, where you know that an exception may occur. And this probably makes sense when there is a very limited about of failure cases.

What I am currently doing is that my convention is to never ever throw exceptions, but to always generate a command specific FailedEvent if sometings goes wrong. (Just to the eventBus if aggregate construction fails, or applied to the aggregate when it is any command at aggregate lifetime).

But in theory every single command may be subject to concurrency issues, and to be 100% sure that I will never create “zombie sagas” I would have to register a callback to every single command ever. (because I cannot detect such issues within the handler) Also I would have to define one callback for each command to be able to generate a command specific FailureEvent (or have some other logic for that).

That sounds not feasible.

I just have the feeling, that I am missing some crucial point of the whole picture, when my goal is to be absolutely sure that any saga will cleanly terminate whatever goes wrong, or whatever load the system is under (raising EventStore version issue probability).

Why would you want to start a UnitOfWork, when all you do in the callback is emitting an event?

Thanks, Steve.

Best regards,
Dominic

Rene_de_Waele1 · October 21, 2016, 3:08am

Hi Dominic,

There appear to be three situations you’re trying to cover:

Your command fails because your aggregate does not permit it (anymore)
Your command fails because of ‘buggy code’ in the aggregate
Your command fails because of a version conflict (aggregate needs to be reloaded or command should be routed to a different machine)

The first situation is ‘expected’ behavior. The aggregate may throw an explicit (checked) exception, or sometimes publish an event. It’s rare that these situations need to be handled explicitly since this mostly occurs when the aggregate state has very recently been changed by another command and these changes have not yet reached the saga. Once the saga has been synced (assuming the saga still exists) then it will most likely not matter that the saga command failed.

If the saga was supposed to end after sending the command (i.e. It’s not interested in other changes from the aggregate) you don’t need to end up with ‘zombie sagas’. If you use an asynchronous command bus and send the command without waiting for a result the saga will be safely shut down ‘no matter what’:

@SagaEventhandler(…)
@EndSaga
void on(TriggerEvent e) {
commandGateway.send(new SomeCommand();
}

Only if the saga is expressly interested in the reason of the command failure is it useful to handle the exception explicitly (either through sendAndWait, a dedicated event from the aggregate or from a command callback). Again, this should be rare and is possibly sign of a design issue in your application.

In the second situation (buggy code) it’s preferable to simply let the command fail, i.e. don’t handle this situation explicitly. If the saga was supposed to end after sending the command then that should still happen (again by using an async command bus). You can usually recover from this situation by retrying the command after you have fixed the bug. It therefore pays to always log the contents of the failed command so it can be easily retried later on. If the saga is still active it will continue like nothing happened.

The last situation is the only one where the exception is ‘transient’ and so it may be that the application can recover by sending the same command again (after a few seconds). Simply register a command retry scheduler on the command gateway to automatically recover from these situations. If the command keeps failing you’re probably dealing with (severe) infrastructural issues. Like in situation 2 you may then need to manually retry sending the command after resolving the issues.

In summary, no: you will hardly ever need to handle exceptions explicitly. You mentioned that you prevent throwing exceptions from command handlers. That view is wrong: it’s nearly always the right thing to do (even if you publish ‘failure events’ as well). The dispatcher of the command either doesn’t care if the command fails (again: therefore ‘always’ use an async command bus) or explicitly cares about the command result (including exceptions). In both cases an exception works better. In no case do you end up with ‘zombie sagas’

Best,
Rene

Dominic_Heutelbeck · October 21, 2016, 10:38am

Hello, René.

Thank you for your insightful answer. This was very helpful.

The cases of “buggy code” and “retry” are 100% clear. The only question is where a good place would be to hook in a generic logging component for failed Commands.

The fact that you say one should consider the situation, that exception/failure as expected behaviour has some clear implications for the approaches with regards to validation. It is expected, that validation (e.g., some value object contains user input and must be sanitized) will fail. This means that validation of this kind of constraints should happen outside of the command handlers. Also doing it with interceptors for validation would only lead to the same issue for the dispatching component. Thus my conclusion is that validation of user input hat to be done before or at command construction time.

This is in my mind at lease connected with the decision of how to go about with designing command objects and events. On the one side, one should avoid to leak domain logic into objects which turn into messages (commands/events). On the other side it is most effective to have the validation in the construction of the commands.

In all the examples I have seen, some go for 100% raw java primitives in the Commands/Events, some use value objects. The value objects are where I tend to put the simple user input validation logic, Thus, let’s take a simple example:

a) RenameCommand(AggregateId id, Name newName)

vs.

b) RenameCommand(String id, String newName)

a) would have the “is name a string without HTML and script tags” logic in the class Name. Thus once the command is dispatched there will be no expected behaviour in the form of “the command failed because the name was invalid exception/event”

b) would need to either validate name in an interceptor or in the command handler. This would make the failure (exception/event) a necessary expected behaviour. And this is now obviously a bad idea.

So a) would be preferred. The Axon examples are a bit inconsistent in how this is handled, which creates doubts what would be good practices. (But I understand that “it all depends”). I would prefer if the Framework would be a bit more opinionated here.

Now an example for where failure is expected is ChangeUsernameCommand where some index for claimed usernames is consulted and a Failure Event is created if the username is previously claimed. Now you have the following choices:

a) Throw an exception

b) Create an UsernameChangedEvent

c) Return a FAILED as return value of the command handler

d) Any combination of the above.

Something that did not work for me was to publish an event and then throw an exception, Because the exception causes the unit of work ro fail and the event is never propagated. Maybe with some manual UoW work this could be done.

I currently have two use-cases in mind where this “expected behaviour failure” would be of interest:

In a REST API or in a UI.
In a Saga

In the case of a REST API I would actually prefer a) or c). I can either handle the exception or the return value via a command gateway, and return the matching HTTP return code and a matching error document

In the case of the SAGA, I need to be able to continue when the command succeeds for fails. This is expected behaviour of the domain model, and to me this feels like it should be event-based and not exception-based. Using the command gateway to send and wait feels wrong to me in the saga. I do not want any step in a saga to “wait”. It is: get event … process … act … exit … potentially persisted … reload on next event. I do not want to wait for anything in an event handler. Thus, having the command to generate events for failures for expected behaviour sounds most natural to me. Maybe I am making false base assumptions here.

So to support both use-cases, expected failures would either do “exception and event” or “return value and event”. The first of the two is problematic due to the unit of work and exceptions issue. Thus I would tend to do “return value and event”

Conclusion:

Try not to create failures due to simple validation. Validate before dispatching commands if possible.
minimize the amount of expected failures
if you have one … do “return value + event”

Does that make sense?

Thanks again.

Best regards,

Dominic

Rene_de_Waele1 · October 21, 2016, 10:03pm

Hi Dominic,

I’ll go through your email bit by bit below as there’s a lot to unpack ;).

The cases of “buggy code” and “retry” are 100% clear. The only question is where a good place would be to hook in a generic logging component for failed Commands.

Perhaps the easiest way would be to register a custom CommandCallback at the CommandGatewayFactory that logs each failed command (using e.g. XStream) as well as the stack trace of the exception. Ideally also log some context like the aggregate identifier and sequence number of the aggregate at the time of invalidation. In Axon 2 you can also use an AuditLogger for this (if you’re on Axon 3 things are even easier – in that case let me know).

The fact that you say one should consider the situation, that exception/failure as expected behaviour has some clear implications for the approaches with regards to validation. It is expected, that validation (e.g., some value object contains user input and must be sanitized) will fail. This means that validation of this kind of constraints should happen outside of the command handlers. Also doing it with interceptors for validation would only lead to the same issue for the dispatching component. Thus my conclusion is that validation of user input hat to be done before or at command construction time.

Ideally the aggregate does not need to concern itself with structural validation of commands. It should only make sure that a given command is valid given the state of the aggregate.

I therefore do not consider a structurally faulty command to be a case where a command handler exception is to be expected. I’m not against structural validation using a dispatch interceptor (i.e. after command construction). Consider the following:

Object command = new RenameCommand(…);
commandGateway.send(command);

If the command is structurally invalid (e.g. the newName is too long) I don’t care if an exception is raised in line 1 or line 2 as long as the exception is raised before the command is handled by the aggregate. A dispatch interceptor will throw its exception before the command is dispatched to the command handler, i.e. in the thread that sends the command, so this would work as well.

a) RenameCommand(AggregateId id, Name newName)

vs.

b) RenameCommand(String id, String newName)

I also agree that a) is preferred, but mostly because it leads to more readable code and less chance to mix up the order of constructor parameters.

Something that did not work for me was to publish an event and then throw an exception, Because the exception causes the unit of work ro fail and the event is never propagated. Maybe with some manual UoW work this could be done.

By default Axon does not roll back the UoW if the exception is a checked exception.

In the case of the SAGA, I need to be able to continue when the command succeeds for fails. This is expected behaviour of the domain model, and to me this feels like it should be event-based and not exception-based. Using the command gateway to send and wait feels wrong to me in the saga. I do not want any step in a saga to “wait”. It is: get event … process … act … exit … potentially persisted … reload on next event. I do not want to wait for anything in an event handler. Thus, having the command to generate events for failures for expected behaviour sounds most natural to me. Maybe I am making false base assumptions here.

Once the command has been structurally validated (outside the aggregate) I don’t think this situation is very common as I wrote in my last email. However, in those rare cases that the saga needs to know why its command was rejected it may seem preferable to use a failure event instead of an exception. What I don’t like about failure events from the aggregate however, is that the aggregate is then forced to publish events it doesn’t care about. In fact the only component that cares is the dispatching saga. An exception seems cleaner, especially as these are such rare cases. If you don’t want to block the saga I liked Steve’s suggestion to dispatch the command and collect the result in a CommandCallback. If the command fails publish an event meant for the saga (or load the saga and invoke a recovery method on the saga). If you’re using Axon 2 publish the event using the EventTemplate.

So to support both use-cases, expected failures would either do “exception and event” or “return value and event”. The first of the two is problematic due to the unit of work and exceptions issue. Thus I would tend to do “return value and event”

As we’ve seen, simply throwing an exception covers all cases. Failure events are not required at the aggregate level.

Conclusion:

Try not to create failures due to simple validation. Validate before dispatching commands if possible.

minimize the amount of expected failures

if you have one … do “return value + event”

So totally agree except for the last point. An exception is usually (if not always) better.

Now an example for where failure is expected is ChangeUsernameCommand where some index for claimed usernames is consulted and a Failure Event is created if the username is previously claimed.

By the way, this specific example may be one where the aggregate should not validate at all. Before dispatching the command check if the username is taken, if so throw an exception before dispatching the command to the handler. If two users still select the same username (simultaneously) the application should be able to roll back the second command. A unique index on the username column in the query side may be enough to initiate the rollback. The advantage of this approach is that the User entity handling the command is unaware of other User entities, which is probably for the best.

Hope this helped and didn’t make things more confusing :).

Best,
Rene

Dominic_Heutelbeck · October 22, 2016, 2:26pm

Hello, René.

Thank you very much this not making things more confusing. This is clearing up a lot of details here. If you allow, I would like to continue on here. I hope this discussion is relevant to more new Axon users as well.

Perhaps the easiest way would be to register a custom CommandCallback at the CommandGatewayFactory that logs each failed command (using e.g. XStream) as well as the stack trace of the exception. Ideally also log some context like the aggregate identifier and sequence number of the aggregate at the time of invalidation. In Axon 2 you can also use an AuditLogger for this (if you’re on Axon 3 things are even easier – in that case let me know).

We are at the early stages of project I am running with students. So we are not in production and go for the bleeding edge and use the 3.0 build as they become available. So basically my questions currently proxy 15 students I am taking with me on the Axon adventure.

So yes, please elaborate on the 3.0 solution to this.

Ideally the aggregate does not need to concern itself with structural validation of commands. It should only make sure that a given command is valid given the state of the aggregate.

I completely agree with this view. And I now know why the experiements failed. We did not consider the difference in handling checkes vs unchecked exceptions and basically turned structural valifation into IllgalArgumentExceptions. Thus the UoWs were rolled back as you describe below. Makes sense now.

I therefore do not consider a structurally faulty command to be a case where a command handler exception is to be expected. I’m not against structural validation using a dispatch interceptor (i.e. after command construction). Consider the following:

Object command = new RenameCommand(…);

commandGateway.send(command);

If the command is structurally invalid (e.g. the newName is too long) I don’t care if an exception is raised in line 1 or line 2 as long as the exception is raised before the command is handled by the aggregate. A dispatch interceptor will throw its exception before the command is dispatched to the command handler, i.e. in the thread that sends the command, so this would work as well.

Very good. Me “not taking into considerarion checkesd vs unchecked” made me worry about a few later steps I had planned with regards to authorization. Now that I know that I can cleanly handle thise cases with custom “AccessDeniedExceptions” from an authorization command interceptor I am feeling much better.

a) RenameCommand(AggregateId id, Name newName)

vs.

b) RenameCommand(String id, String newName)

I also agree that a) is preferred, but mostly because it leads to more readable code and less chance to mix up the order of constructor parameters.

Yes. I always felt this way too. It is also much more in line with DDD principles, making the ubiquitous language the way to express commands instead of primitives where you have to translate the ubiquitous language into technical concepts isntead of usuing it directly.

There were two factors leading to me trying out the “lets try out what happens if we use primitives only”:

recent Axon code examples were only using primitives.
I went through some online discussions on Value objects, Event, Commands, CQRS. And there is a (from my point of view correct) concern, that using Value objects may result in leaking domain logic into the commands/events.

Based on these recent experiences I would argure that you should use Domain specific Value objects in Commands/Events, as they make code easier to read, and less error prone. BUT you should take care not to use objects that are more than just sumb data containers with more than simple structural validation to not leak domain logic into the events/commands zipping around the busses and into the event store.

By default Axon does not roll back the UoW if the exception is a checked exception.

facepalm

Once the command has been structurally validated (outside the aggregate) I don’t think this situation is very common as I wrote in my last email. However, in those rare cases that the saga needs to know why its command was rejected it may seem preferable to use a failure event instead of an exception. What I don’t like about failure events from the aggregate however, is that the aggregate is then forced to publish events it doesn’t care about.

I have to agree. It felt bad to define all these new events to be able to disambiguate between the different failures. It clogs up the Aggregate API, command handlers, and hides the ubiquitous language which is actually to be expressed be the commands.

In fact the only component that cares is the dispatching saga. An exception seems cleaner, especially as these are such rare cases. If you don’t want to block the saga I liked Steve’s suggestion to dispatch the command and collect the result in a CommandCallback. If the command fails publish an event meant for the saga (or load the saga and invoke a recovery method on the saga). If you’re using Axon 2 publish the event using the EventTemplate.

What I was doing now is to inject the EventBus in the respective handler and to dispatch the failure event there. And I think the notiion that “only the one Saga cares about this” is actually a good indicator for the idea to use an exception in this case. Only use an event, if there is actually another component which is interested.
The reason why I feel a bit uneasy about the callback is, that I do not yet have 100% grip on the saga behaviour in this case. I was wondering, if it may be that I may not make any assumption about when the command will be handled. And also I do not yet understand 100% when a Saga is put at rest to be persisted. So I was askingmyself what happens, if I register an anonymous callback class, and for some reason the infrastructure decides to persist the saga. My assumption was that to be sure that the saga actually survives odd situations may be to always use events.

As we’ve seen, simply throwing an exception covers all cases. Failure events are not required at the aggregate level.

Will probably make the code much cleaner. The events dispatching also created alot of boilerplate in the command handlers.

Now an example for where failure is expected is ChangeUsernameCommand where some index for claimed usernames is consulted and a Failure Event is created if the username is previously claimed.

By the way, this specific example may be one where the aggregate should not validate at all. Before dispatching the command check if the username is taken, if so throw an exception before dispatching the command to the handler. If two users still select the same username (simultaneously) the application should be able to roll back the second command. A unique index on the username column in the query side may be enough to initiate the rollback. The advantage of this approach is that the User entity handling the command is unaware of other User entities, which is probably for the best.

For simple application the approach you describe is sufficient. I am actually forking on a domain model where authentication/authorization is the core bounded contest and not a simple simple generic support context. So for me this has to be 100% tight.

After a lot of experimentation I came to the following conclusions:

at no point may any query model change a value object which has a “must be unique across a set of aggregates” (such as the username). In the case of a security domain, I cannot consider the attempt to change it to an existing one “a rare exception”. I consider this a deliberate attack scenario. Any moment where some of the query models becomes inconsistent with regards to such constraints my be a vounerability. It is just fine to have eventual consistency overall, and that it takes some time for updates to probagate.
Example: I may have two query models for different voews in the application where I have a column with “username”. While it is possible to to enforce the “unique” constraint on each of the models, it becomes at least a little complex to reason about potential race consitions, and also the aggregate itself may be in an inconsistent state with regards to username uniqueness until the view models have detected the violation upon update. Also you suddenly have to deal with the rollback. And in an attack szenario where this most likely this will not be a rare occasion, but a forced occurence.
I think: The aggregate may only change when the uniqueness guaranteed. There may never be inconsisten change events that propagate to the query models.

Solution:

Introduce an additional simple index in the domain model itself.
Creation of the aggregate and for changing the value with the uniqueness constraint reside in a service and not within the aggregate in order to have access to the index.
The command handler tries to claim the value before instanitating the aggregate or calling the “changeValue” method of the aggregate.
Result: The aggregate will only ever chreate “UsernameChangedEvent” when there is no conflict.

This does not mean that I do not first check if the username is already claimed before dispatching the command. It is just a) the paranoid implementation b) keeps the logic within the aggregate clean c) avoids messy rollback protocols. c) The query models only ever get consistent updates.

I also consider it bad form have acritical part of the domain consistency logic to rely on query models. It is better to enforce this within the domain logic part of the code, before it propagates to the query models. I think I have actually implemented most of the possible versions of this and tried them. And this is my current stance of the use-case.

Hope this helped and didn’t make things more confusing :).

Again, super helpful. Thanks a lot for taking the time to answer.

Best regards,

Dominic