A few things I’ve been bitten by when getting zero-downtime deployment working on my 2.x app in a cluster:
It’s possible for there to be locally queued-up commands on a node at the time you set its load factor to 0. This can potentially lead to ConcurrencyExceptions if one of the commands operates on a busy aggregate, since another node will start processing commands for that aggregate at the same time. I solved this adequately (as in there’s still a time window but it’s very tiny) by wrapping the local SimpleCommandBus in a helper class that looks at a flag that gets set as soon as we send the JoinMessage with the zero load factor. If the flag is set, the command is bounced back to the DistributedCommandBus for delivery to the newly-correct node.
A similar problem can occur when a new node is added to the cluster while there are commands waiting to be executed on an existing node for an aggregate that the new node will take over. I haven’t yet come up with a solution I like for this case (anyone else?). But the outline of a solution might be something like keeping track of the time the new node announced its arrival, and taking a similar approach to the above for any commands that arrived before then.
In a Spring environment, the AnnotationCommandHandlerBeanPostProcessor will try to unsubscribe all the command handlers at shutdown time. If you’re already explicitly sending out a JoinMessage with a zero load factor and an empty set of commands to unsubscribe everything in one go, this can produce harmless but noisy error messages. Luckily, Axon anticipates this, and you can just set that bean’s unsubscribeOnShutdown property to false.
I should add that the first problem has only happened in stress tests where I’ve brought nodes up and down while pounding the application with lots of requests; normally there won’t be a backlog of commands that would cause trouble. But if it can happen in a stress test, it can happen in real life, so it’s worth addressing unless you know your application will never get bursts of usage while you’re adding or removing nodes from your cluster.
-Steve