I have been running some load tests on our Axon based platform over the past few weeks. Under certain load, the platform performance degrades… a lot… I am investigating possible causes but my main focus is right now on our Event Store.
I have noticed that up until the execution of the @EventSourcingHandlers it all goes smoothly. That would be… loading the aggregate, executing the logic of the command, and executing the actual @EventSourcingHandlers (and whatever else Axon does under the hood). Under heavy load, everything that happens after the @EventSourcingHandlers starts taking a bit while… Things that may be causing these lags could be stuff like inserting the actual events into the Event Store or potentially generating and storing Snapshots in the database (or again, whatever Axon does under the hood). By looking at certain metrics our AWS Aurora database is giving us, the query that is executed more often is:
select next_val as id_val from hibernate_sequence for update
Which is the sequence used for event insertion in the event store. The difference with other queries, updates or inserts is just huge.
Our Event Store is in an Aurora MySQL database on AWS and at the time of writing this message contains around 88.000.000 events taking a total of 300GiB of disk space in the environment we are load testing (approximate numbers). The service that contains the Aggregate logic is just one, but there are two instances of it running. We are not using the DistributedCommandBus, but a distributed cache for the Snapshots, meaning that when we receive a command, given that the snapshot lives in a distributed cache, no matter who receives it, it will retrieve the latest version of it.
So, I have kind of run out of ideas on what may be causing these lags. At the moment I am concerned about the fact that so many concurrent writes in the Event Store may be an issue, that the service is waiting for a long time to actually insert the events in the database under high load circumstances. Can it be the case? Do you have any ideas on what direction I can take here? I know this whole question is too vague and a long shot, but maybe you guys, with a deeper knowledge of Axon than I have can think of some action plan.
If that was the case and I could prove it, I would consider using Axon Server since, from what I have read and heard, it could help, but I obviously don’t want to implement such a big change (with the time and money costs it would involve) without knowing 100% that my current bottleneck is our MySQL Event Store.
Any help would really be appreciated