Running two instances of the same axon server for redundancy

I’m planning on trying to run AxonFramework on Heroku. For high avalability I want to run two instances of the same command handlers. How would I go about making sure that they both keep up with the work but only one set of events come out in the end?

What I want to accomplish is two command handlers where one can die and be restarted (Heroku dynos are restarted at least once per day to keep them healthy) without the flow of commands being paused.

It will look something like this (–> == over the network) :

Client --> Commands over AMQP --> [ Axon Command handler server 1 (JPA EventStore) ] --> Events over AMQP --> View model handler
[ Axon Command handler server 2 (JPA EventStore) ]

Given the same set of commands, two command handlers running the same code should produce the same set of events (not considering external inputs). One thing I’ve been considering is to let them have their own independent event stores. Then I would let them each get a copy of every command from the queue and do their stuff at the same time. On the other side I would need some kind of event deduplicator to only get one set of events for the view handlers.

Is this a silly thing to do? How would you set this up?

/Per

How about an exclusive connection to the command queue(s)? That way you can have both the instances running, but only one of them will succeed in getting a connection to the queue. The other one will retry to connect until it succeeds, in which case the first command handle will have gone down and the second will now own the exclusive connection.

And doing this with a shared EventStore since only one handler will be accessing the store at a time?

I guess any cached aggregate in a handler that becomes active would be stale and would have to be reloaded from the event store, but if they don’t switch active / passive too often that would be ok. Hopefully it will be faster than waiting for a single handler to come back up again.

Shared event store yes. And I assumed the only reason for failing over was that one of the nodes was restarted. Of you switch back and forth without restarting in between you are bound to end up in a weird state ; )

The switch is definitely a lot faster than a cold boot (in the order of 10-100ms?), with a slight (depending on your app) performance decrease after the switch before the caches are warmed up.

Thanks for the info. I will try this. This also answered my question why I should use AMQP (RabbitMq) instead of something simpler like IronMq.

You could also consider using a shared database (one provided as a service by Heroku, for example) and use the Distributed command bus with the JGroups connector to spread the load between active instances. That way you’ll have two instances using their CPU, instead of one being stand-by all the time.

The DistributedCommandBus is more work to set up, compared to Sebastian’s suggestion.

Note that a failover can take up to 2 seconds. This is the interval in which the AMQP connector will retry the connection after one is refused. You can reduce the interval if faster failover is required. The Jgroups failover is a lot faster.

Cheers,

Allard

This sounds like a nice way of utilizing the resources. The problem would be how the members of the JGroup talk to each other. Heroku dynos can’t listen to any external ports (other than the supplied HTTP port, but that is not very interesting here).

Hi Per,

I wasn’t aware of that limitation, but now that you mention it, it does ring a bell. It’s theoretically possible to use JGroups with any type of protocol you want (including HTTP), but that will definitely add some complexity to the whole project. My gut feeling is that this is not really the intent of your trial…

There is one more approach you could have a look at. You could have each node receive each event. Each node will evaluate the command and decide for itself if it should be executed. Simply said: if (aggregateIdentifier % 2 == 0) { execute it }. That will spread the load. But during a restart, it will also stall half of your commands for a brief moment.

Sounds like there could be good use for a HTTP based load balancer for commands…

Cheers,

Allard

Hi Per,

I wasn’t aware of that limitation, but now that you mention it, it does ring a bell. It’s theoretically possible to use JGroups with any type of protocol you want (including HTTP), but that will definitely add some complexity to the whole project. My gut feeling is that this is not really the intent of your trial…

No, for now I think I will go with a single node and see what kind of downtime I will have on restarts. All commands will be waiting in the (managed) AMQP queue, so it’s only a matter of time delay.

Sounds like there could be good use for a HTTP based load balancer for commands…

Yeah, but then I would have to manage that one as well…