Load tests and performance degradation

Armando_Fernandez · June 10, 2020, 5:14pm

Hi all,

I have been running some load tests on our Axon based platform over the past few weeks. Under certain load, the platform performance degrades… a lot… I am investigating possible causes but my main focus is right now on our Event Store.

I have noticed that up until the execution of the @EventSourcingHandlers it all goes smoothly. That would be… loading the aggregate, executing the logic of the command, and executing the actual @EventSourcingHandlers (and whatever else Axon does under the hood). Under heavy load, everything that happens after the @EventSourcingHandlers starts taking a bit while… Things that may be causing these lags could be stuff like inserting the actual events into the Event Store or potentially generating and storing Snapshots in the database (or again, whatever Axon does under the hood). By looking at certain metrics our AWS Aurora database is giving us, the query that is executed more often is:

select next_val as id_val from hibernate_sequence for update

Which is the sequence used for event insertion in the event store. The difference with other queries, updates or inserts is just huge.

Our Event Store is in an Aurora MySQL database on AWS and at the time of writing this message contains around 88.000.000 events taking a total of 300GiB of disk space in the environment we are load testing (approximate numbers). The service that contains the Aggregate logic is just one, but there are two instances of it running. We are not using the DistributedCommandBus, but a distributed cache for the Snapshots, meaning that when we receive a command, given that the snapshot lives in a distributed cache, no matter who receives it, it will retrieve the latest version of it.

So, I have kind of run out of ideas on what may be causing these lags. At the moment I am concerned about the fact that so many concurrent writes in the Event Store may be an issue, that the service is waiting for a long time to actually insert the events in the database under high load circumstances. Can it be the case? Do you have any ideas on what direction I can take here? I know this whole question is too vague and a long shot, but maybe you guys, with a deeper knowledge of Axon than I have can think of some action plan.

If that was the case and I could prove it, I would consider using Axon Server since, from what I have read and heard, it could help, but I obviously don’t want to implement such a big change (with the time and money costs it would involve) without knowing 100% that my current bottleneck is our MySQL Event Store.

Any help would really be appreciated

Regards,
Armando

allardbz · June 11, 2020, 7:33am

Hi Armando,

in general, relational databases will suffer from performance loss when inserting data into already large tables. We have done numerous performance tests in the past which showed this decrease. There are especially big “bumps” at the moments where the data and/or indices don’t fit in memory anymore.

The above is exactly the reason why we started building the Event Storage engine in AxonServer. It has been designed from the ground up to address the issues related to event sourcing, with insert performance independent of the number of entries already available, efficient lookup of events for an Aggregate, and fast streaming of events for downstream processors.

I’m sure there could be some improvements on the Aurora configuration to optimize throughput (maybe look for al alternative for the write lock on a sequence), but that takes time and effort that is most likely better spent elsewhere, and it will keep degrading regardless, at best a bit slower.

It shouldn’t take much time to spin up an AxonServer node (especially on Amazon) and to test drive that.

Cheers,

Armando_Fernandez · June 11, 2020, 8:34am

Hello Allard,

Thanks for the prompt response. I am indeed considering giving Axon Server a go, to be honest.

Another alternative I had in my mind was to scrap the event store. Have state-stored aggregates and still use the events but instead of using an Event Store to store and publish them, use the axon-kafka module for my tracking processors. In our system (a gaming system where the aggregate is the player), I’d say we wouldn’t have more than 1.000.000 different instances of the aggregate (players) so that would be the number of rows in the aggregate table (not that many really). So my concern here would be the number of updates to the database when applying changes to an aggregate. Have you guys ever load tested something like that? And also, do you see any obvious problems with this approach? (state-stored aggregates + axon-kafka for tracking processors).

Thanks,
Armando.

allardbz · June 11, 2020, 9:02am

Hi Armando,

state-stored aggregates will have challenges of their own. You’ll either have serialize them entirely and store them as “blobs”, which means large blob updates (which is from ideal in databases) or you’ll need to manage the aggregate-to-entity mapping for JPA. My personal experience is that event sourcing puts significantly less restrictions on an aggregate’s structure.

Another challenge will be guaranteed publishing and storage, especially in combination. There will be a window of opportunity where state changes are persisted, but events haven’t been published, or vice-versa.

Last challenge I see is to guarantee that the event stream is complete and correct. Of you only want to do a one-off processing of the events in a single target view, then that won’t be a problem. However, if you intend to store the events for later use to generate other views or do analytics, it is important to ensure the correctness and completeness of the stream.

Obviously, I am completely biased. However, that bias does come from over 10 years of experience applying event sourcing in various environments. I won’t say event sourcing is completely without challenges, but in the course of these years I’ve seen tooling and practices emerge to help mitigate many of those.

Just my 2 cents.

Armando_Fernandez · June 11, 2020, 9:39am

Hi Allard,

Thank you very much, this does help actually.

I will consider using Axon Server. We are using Axon 3.4, so I guess giving Axon Server a go will be a three steps process…

Migrate to Axon 4.x
Migrate my MySQL event store to Axon Server.
Plug Axon Server in.

On the second step, I recall finding a tool to perform such a migration. Has it been tested with big Event Stores? Our Production one is around 800GiB, and it would be probably bigger when/if we perform the migration, say 1TB. I would need to be sure that the migration is feasible.

Apart from all the above, are there any other obvious things I should consider before making the decision to migrate to Axon Server?

Thanks again Allard,
Armando.