Question about DistributedCommandBus & SpringIntegrationEventBus deployed on clustered application server

Hi,

in our project we are using the axon framework. Everything works on our local dev machines as expected. However, when we are going to deploy it on our dev environment, things will start to break..
Let me explain why:

We deploy to a clustered JBoss and the command handling and event handling code is included in our war. So, the same handlers are deployed twice. We are using the DistributedCommandBus with JGroups and the SpringIntegrationEventBus (on a JMS queue). The clustered JBoss is not our choice.. we got it from the infrastructure team if we want it or not. Negotiations will start next week to alter this config. Unfortunately, that's how it goes in big corporations. So, I would like to know my options in case we have to stick to the clustered config.

At first, one might think the DistributedCommandBus is a decent solution to make sure all commands on the same AggregateRoot(AR) instance are executed on the same JBoss instance. This works as expected with our own dispatching policy. However, what happens when a JBoss instance goes down while handling a command? The user was already notified the command will be processed, but because of the JBoss going down, the command gets lost as the unit of work could not commit the events to the event store and event bus.

To resolve this, I could create a SpringIntegrationCommandBus (on a JMS persisted queue, with transaction ACK). But in that case I still might be forced to have only one single command handler instance (to avoid duplicate deliveries and order issues). Or I could create an interceptor that logs the commands to a database before it is placed on the command bus (and perhaps cleaned up after the events have been committed). In that case I could keep my two command handler instances on both the JBosses. But to make sure I don't reinvent the weel here, as I think I'm not the first one trying to solve this, is there some command bus already doing this that I missed in the Axon library? Or are you planning such classes in the near future? 

On the event bus and event listeners, we have similar issues. We're using a JMS persisted queue. Events are dispatched to the SpringIntegrationEventBus. Having two JBoss instances, we will have two listeners and we might get starting duplicate entries in our read model. I've been trying to implement the sequencing strategy I found in Jonathan Oliver's blog post (http://blog.jonathanoliver.com/cqrs-out-of-sequence-messages-and-read-models/) but without luck.. as I want the dispatched events to be handled by the AsynchronousCluster. I found it very hard to write an adapted UnitOfWork based on InheritedThreadLocal (as the child threads will create their own UOW but have to be part of the overall UOW) and I'm not even sure it can be done at this point as we don't exactly know how much child UOW will have to be committed before the outer UOW can be committed.

And somehow I've the feeling I'm trying to make everything to difficult to solve this JBoss clustering issue. So, I'm wondering, aren't there any simpler solutions out there?

Any feedback would be much appreciated

Kr,
Steven

Hi Steven,

My team has put off clustering since we've been wondering about the same issues. We also have the same issues of corporations and standardized deployments, but fortunately being forced to cluster wasn't one of them :slight_smile: We'll have to deal with this in the future though as we want HA/DR.

We've been experimenting with your idea of using an interceptor to write commands to a log (and delete the log on tx commit) with some success. We use this for AsynchronousCommandBus -- it's the same problem whether clustering or merely async, on the command side at least. If something goes wrong, we can "retry" the command later. (So far the retry is not automatic.) Part of the problem is to assume the server can just "go down" at any moment. When shutting down a server, can you bleed off users and flush in-progress commands? If so, then maybe a "crash" becomes something you can detect and use to trigger retry from the command log.

If you've already informed your user their command "will" be processed, well, is that part of the problem too? Commands are asynchronous, yes, but can you wait for a command success callback before updating the UI? If so, and if you can mostly rely on shutting down cleanly, then maybe the whole issue is moot. Worth asking yourself.

As for your questions about duplicate event handling, I'm not sure I understand how if you're using a JMS queue you could handle an event twice. A queue would only deliver each message once. That said, why do you need to cluster event handling at all? It seems like there are two ways you can design it -- and yours is not one of them. (Granted I have not personally tried either of these yet, but my team has put some thought into it.)
1) Use JGroups command bus (as you are), and since you have the same WAR deployed everywhere, do NOT distribute event handling using Spring Integration. Events will be handled on the same instance that handled the command, which is a good thing for caching, locking etc. as well since the command bus will already route commands for the same aggregate to the same instance. (Distributing events over a JMS queue will kill your caching and expose yourself to out-of-order and other race conditions as each event will be delivered to a random cluster member.) This also reduces load on JMS, and avoids the need for transaction synchronization between DB and JMS (are you using XA/2PC? Do you trust it?)
2) Use pub/sub (JMS topics) for event distribution and designate each server to run a mutually exclusive subset of event handlers. For example, one server generates the UI views, and the other generates reports. They're both connected to the same database of course, so either server could serve UI requests. I don't think this makes sense in your case but I hope it illustrates that the reason you'd want to distribute events over JMS doesn't seem to jibe with your scenario.

There's another question of how to distribute Sagas that neither of us have addressed. Are you using sagas?

Hope that helps. I'm *very* interested to hear if you're successful in clustering!
-Peter