Axon Server load balancing

turnerjasonuk · June 30, 2022, 12:00pm

I found somewhere a comment that Axon Server provides some load balancing support/capabilities, but I am unable to find anything detailed in the documentation. I assume I am looking in the wrong place - could someone please give me some links?

I am looking at using Axon Server and Axon Framework along with Kafka as an inter-service message bus, with high throughput for some channels and services - so this is an area of definite interest to me. I am currently using Kafka Streams and this is handy in being able to manage scaling out of processors against well partitioned topics - however Kafka & Kafka Streams are not a good match for event sourcing - so I would like to use Axon possibly but need to see what level of support vs dev work there would be for us around managing scale out.

Gerard · July 1, 2022, 8:21am

Hi Jason,

I have used Kafka before at several clients so I know I thing or two about it. One of the create things about Kafka is that it’s pretty dumb, and therefore good at scaling. It’s up to the clients to send messages to the correct broker instance. For consuming messages there is some help with consumer groups, and offset management, making it possible to easily scale out consuming applications.

The message routing of Axon server is a bit smarter. For example it will send commands for the same aggregate to the same instance, as part to handle concurrent commands in a proper way. It’s also a great help on the query side, where you can either get the fastest result back of several applications that can answer the query, or combine the result from several applications.

With the use of the Kafka extension it’s pretty easy to stream events from Axon server to Kafka, for example for other teams or applications to integrate with. For event sourcing Kafka is lacking a good way to prevent creating two events for the same aggregate from two instances which might create an invalid state when both are applied. There is also no way to efficiently rebuild an aggregate, as you can’t get messages by key. So you are right that Kafka is not a good match for event sourcing.

We have planned a piece specifically to compare message routing with Axon Server, I’m not sure when it would be done. In the mean time, feel free to ask any follow up questions.

turnerjasonuk · July 1, 2022, 10:36am

Thanks Gerard.

I am currently reading through docs - can I just ask you a few questions, I am sure they are covered but you will clearly be able to answer more quickly, or point me to the right info:

I should note by way of context that I my domain (finance) challenge (data integrity) demands both high volume and throughput (e.g. 1.5 bn short lived new entities a day) and strong consistency inside services (hence partitioning and caring about message order).

My commands (many of which will be transformed published events from other services) will be partitioned - what guarantees on message delivery and ordering are available for reading from a Kafka source topic partition and sending through to an aggregate in Axon and what is required to achieve?

Gerard · July 1, 2022, 11:48am

Currently the Kafka extension can only be used for events, so I don’t think that will help if you want to transform the events from Kafka to commands on the Axon side. You could read the events using the Kafka client, and process each batch async do have a decent throughput, while having at least once guarantee. There might be duplications, so the aggregate should be able to handle those, so you don’t create two events on the Axon side for the same mapped command. You could for example send the partition and offset with the command, and keep those in the aggregate. (If you are 100% sure event for the same aggregate you could do with only the offset, and you only need to keep the latest offset in the aggregate.)

I am confident this will fulfill the data integrity demand, but I’m not sure about the amount.

I could ask with people that do know, but I want to be sure I have the right question for them. Those 1.5 bn entities, are 1,5 * 10^9 entities? And for each entity there is one, or multiple commands? Also are they evenly spread throughout the day, or are there peak times the throughput is much higher? Is this the main thing the Axon cluster should handle, or is there lot of processing based on the events?

turnerjasonuk · July 1, 2022, 3:32pm

Yes - the data is around trades or transactions, and daily volumes keep going up - especially on bad days which is when the system is most needed.

So that is 1,500,000,000 per day. There are peaks during the day, so the peak rate is about 75,000,000 per hour.

They are normally active for a few days, although sometimes there may be a later correction - hence the 10 days and 200 days periods. TBH typically there are only a tiny handful of commands/events per entity. Similarly for other logic which is separate but related - so I have a number of services which will have that kind of volume for their particular data.

The actual processing is not complex there. The main complexity we have moved outside to stateless engines which do various integrity validation processing - those accept data and ultimately generate commands. The Axon part would be ensuring we have the correct state wrt validation once it has been done.

Gerard · July 4, 2022, 8:15am

It’s for a big part depending on the hardware, the most important part is what kind of storage is used. But also reading this it should certainly be possible. There is also some info in the pdf on how to optimize throughput.

I guess a good next step would be to build a poc, and do a bit of load testing with that. Please let us know if you run into any kind of problem.