HA setup

Hello list,

As we are nearing our public release it’s time to think a bit harder about high availability.

We have developed some HTTP microservices running Spring and Axon with the AMQP event bus terminal and JpaRepository/HybridJpaRepository for persistency.

I would like to have the HTTP load balancer dispatch request to multiple independent copies of the same service, but I am not sure about the consistency guarantees of this approach.

Is the use of a relational database enough to handle concurrency with pessimistic locking? Any caveat I might not be aware of?

If this is not safe, what are my options?



it’s important to have some concurrency detection in your aggregates. Hibernate optimistic locking will do the trick. Note that the AbstractAggregateRoot has an @Version annotated field “version” for this purpose.
Axon’s locking approach only works within the JVM, so when the same aggregate is accessed on 2 different machines, you need an additional mechanism.

That said, it is also good practice to route commands for the same aggregate to the same machine, as much as possible. Only when topology changes, you want to modify routing. This makes handling of commands much more efficient.




Our system has tightly-clustered pairs of servers in each regional location, and looser communication between regional clusters.

Within a clustered pair, work is allocated by partitioning the set of aggregate IDs (actually the hashcodes of their UUIDs) across the current members of the cluster, and rebalanced when the cluster membership changes. To avoid crossover during rebalances we also use distributed locks, but these are taken and released only when the membership changes, not every time a command is handled (we have a few hundred commands firing per second: if we try to take a distributed lock for every command it makes the cluster grind to a halt).

Commands can originate from any member of a cluster but are routed to whichever member currently ‘has’ the target aggregate, based on the current affinity of the target aggregate ID. The member which handles a command is responsible for persisting any generated events in the event store DB, which is shared by all members of the cluster. It also pushes those events to the other cluster member(s) via messaging, and they publish the events on their local event buses. So the querymodels and UI-viewmodels in all instances are eventually consistent.

If one instance goes down then the other instances will automatically take on its work when they rebalance based on the cluster topology change. When the dead instance comes back, it replays from the shared event store as normal, backlogging any events being published to it by the other instance(s). We don’t rebalance the work across the cluster until the restarted instance is fully replayed and ready to do work.

This is quite a complex setup and to avoid excessive inter-node communication it relies on every piece of work having affinity to one node at any given time. It needs very careful synchronisation of event replays/catch-ups and affinity changes. But it achieves our goals of distributing work around the world in a scalable way, with failover and rolling upgrades etc, while also pushing results to every UI more-or-less immediately.

The fact that this approach is even possible really says a lot about the excellent separation of responsibilities in CQRS/DDD in general and Axon in particular. Axon has proved itself to be amazingly powerful and flexible.

In summary, I guess the key pattern is to route commands consistently so that you don’t have to do lots of expensive locking or moving aggregates around (except during failover). And then publish the resulting events to the cluster so that all the query-side models are eventually (but rapidly) consistent.

Hope this helps,