Zero downtime deployment

Prem_C1 · December 9, 2015, 11:36am

Hello All,

We are working on deployments i.e. upgrading our applications - (not necessarily axon itself - although that might happen sometime in the future) with minimal (ideall zero) downtime. Axon’s use of jgroups makes this slightly more challenging than your run-of-the-mill stateless service where maintaining backwards compatibility with the older, running version of the application suffices.

We are looking at a solution where we push configuration changes (at runtime without restarting) to an enhanced JGroupsConnector such that it is able to selectively route commands to a chosen set of nodes before and during upgrades. This is to enable us to do zero downtime deployments. Are any of you working on similar solutions? Any pitfalls/advice you all have encountered?

Allard · December 12, 2015, 10:57am

Hi,

one of our clients is working on rolling upgrades. I will ask them how they’re doing it as soon as I’m there. My gut feel would be to expose the “joinCluster” method of the JGroupsConnector through some monitoring & management interface, so that you can change the “load factor” of a node. This way, you can start up nodes with a load factor of 0 (meaning it won’t receive commands). When ready, you start upgrading the node with a higher load factor. After that, start reducing the load factor of the old nodes.

Cheers,

Allard

Olaf_Molenveld · January 5, 2016, 12:26pm

Hi Prem,

we are building Vamp (www.vamp.io) which is a Canary testing&releasing framework for Docker and Mesos/Marathon container-schedulers. Because Mesos also supports other than Docker container deployables (like JAR’s) it might be interesting to investigate if Vamp can work with Axon and provide zero-downtime deployments (and more). If interested to discuss this, please contact me at olaf@magnetic.io

cheers, Olaf

Jorg_Heymans · May 8, 2017, 11:59am

Hi,

We are also looking into zero-downtime deployment. What is the best way to take a node out of the axon cluster “gracefully” => meaning finish all ongoing command executions whilst no longer accepting new commands. Could this not be accomplished by just closing the underlying channel of the jgroupsconnector, is that a safe thing to do ? It’s not clear to me from the jgroupsconnector how one would adjust the load factor of a running node without disconnecting/reconnecting it.

Thanks,
Jorg

Jorg_Heymans · May 10, 2017, 2:28pm

Well turns out that you can just send a new JoinMessage with the new loadfactor and clusterstate is updated correctly. That was easier than anticipated

Jorg

Allard · May 11, 2017, 12:33pm

Hi Jorg,

to disable a node, without switching it on (or switching it not, but not enabling it yet), you can use a load factor of 0. That means a node can participate in a distributed command bus, but it will not receive any commands. Then increase the loadfactor on the new node and decrease it on the old to “migrate” commands to the new node instead.

Hope this helps.
Cheers,

Allard

Steven_Grimm · May 13, 2017, 11:18pm

A few things I’ve been bitten by when getting zero-downtime deployment working on my 2.x app in a cluster:

It’s possible for there to be locally queued-up commands on a node at the time you set its load factor to 0. This can potentially lead to ConcurrencyExceptions if one of the commands operates on a busy aggregate, since another node will start processing commands for that aggregate at the same time. I solved this adequately (as in there’s still a time window but it’s very tiny) by wrapping the local SimpleCommandBus in a helper class that looks at a flag that gets set as soon as we send the JoinMessage with the zero load factor. If the flag is set, the command is bounced back to the DistributedCommandBus for delivery to the newly-correct node.

A similar problem can occur when a new node is added to the cluster while there are commands waiting to be executed on an existing node for an aggregate that the new node will take over. I haven’t yet come up with a solution I like for this case (anyone else?). But the outline of a solution might be something like keeping track of the time the new node announced its arrival, and taking a similar approach to the above for any commands that arrived before then.

In a Spring environment, the AnnotationCommandHandlerBeanPostProcessor will try to unsubscribe all the command handlers at shutdown time. If you’re already explicitly sending out a JoinMessage with a zero load factor and an empty set of commands to unsubscribe everything in one go, this can produce harmless but noisy error messages. Luckily, Axon anticipates this, and you can just set that bean’s unsubscribeOnShutdown property to false.

I should add that the first problem has only happened in stress tests where I’ve brought nodes up and down while pounding the application with lots of requests; normally there won’t be a backlog of commands that would cause trouble. But if it can happen in a stress test, it can happen in real life, so it’s worth addressing unless you know your application will never get bursts of usage while you’re adding or removing nodes from your cluster.

-Steve

Jorg_Heymans · May 15, 2017, 7:45am

Thanks for sharing this Steven.

A few things I’ve been bitten by when getting zero-downtime deployment working on my 2.x app in a cluster:

cool we’re also on 2.x still.

It’s possible for there to be locally queued-up commands on a node at the time you set its load factor to 0. This can potentially lead to ConcurrencyExceptions if one of the commands operates on a busy aggregate, since another node will start processing commands for that aggregate at the same time. I solved this adequately (as in there’s still a time window but it’s very tiny) by wrapping the local SimpleCommandBus in a helper class that looks at a flag that gets set as soon as we send the JoinMessage with the zero load factor. If the flag is set, the command is bounced back to the DistributedCommandBus for delivery to the newly-correct node.

Our estimated load looks to be no way near what you are anticipating, we did not experience this ConcurrencyException yet.

A similar problem can occur when a new node is added to the cluster while there are commands waiting to be executed on an existing node for an aggregate that the new node will take over. I haven’t yet come up with a solution I like for this case (anyone else?). But the outline of a solution might be something like keeping track of the time the new node announced its arrival, and taking a similar approach to the above for any commands that arrived before then.

In a Spring environment, the AnnotationCommandHandlerBeanPostProcessor will try to unsubscribe all the command handlers at shutdown time. If you’re already explicitly sending out a JoinMessage with a zero load factor and an empty set of commands to unsubscribe everything in one go, this can produce harmless but noisy error messages. Luckily, Axon anticipates this, and you can just set that bean’s unsubscribeOnShutdown property to false.

In my case i just set the load factor to 0, is it really necessary then to also unsubscribe from all the commands ? Semantically it should do the same thing.

Thanks,
Jorg

Steven_Grimm · May 29, 2017, 5:45pm

Yes, that should be equivalent. JoinMessage includes the list of command names as well as the load factor and in my code it was easier to send an empty list rather than figure out the list of commands just so I could keep the command list the same and only set the load factor. But if you have the list of command names handy, there should be no harm sending it as long as the load factor is 0.

-Steve

Vilmos_Rajcsanyi · January 10, 2019, 8:27am

Hi,

And how about AxonServer? Is there any support for rolling updates?
Something similar to this, not letting a projection receive other than the events
until replay is finished.

Thanks,
Regards

allardbz · January 14, 2019, 2:05pm

Hi,

AxonServer doesn’t have anything specific to support rolling upgrades, yet. We do have some plans to implement it, for example by not routing queries to nodes that are still catching up.

Cheers,

Allard