Lags when storing events

Armando_Fernandez · June 18, 2020, 2:35pm

Hello again everyone,

I have been writing some messages lately because we are struggling a lot with our Axon based application performance under heavy loads. Here’s another (more general) question I had last week (for reference): https://groups.google.com/forum/#!topic/axonframework/nDz5_fM6ueU

This time however I have a more specific question… So, while running load tests on our application we experience big lags in the overall response times of certain requests. I managed to locate an example of where we are getting such a lag in the Event Store:

Event1 2020-06-18T13:07:40.107093Z 2294 Event2 2020-06-18T13:07:45.010161Z 2295

There are three values in each row. Event name, timestamp, and sequence number. As can be seen above there is a difference of ~5 seconds between the two persisted events. However, the logic that generates that event is something like:

@CommandHandler fun handle(command: Command1) { // Perform some checks with the aggregate data (super cheap operation) AggregateLifecycle.apply(Event1()) // Make HTTPCall to a different service. This calls takes miliseconds, 200ms top, but normally way less. I have checked than in this particular case, the time for this request was < 200ms AggregateLifecycle.apply(Event2()) }

Considering all the above information I have a few questions. When does Axon persists the events in the database? Does it try to persist all events applied during the processing of a command in a batch when the command processing is completed? Does it persist the events right when they are applied (AggregateLifecycle.apply())? How does Axon work internally in that respect? Also, given that both events are being applied in the same command handler and there are no expensive operations between applying the two events… what may be causing that huge difference in the timestamps? Again, this only happens during high load periods… There are no obvious error logs around (no error logs at all for that matter) and the logic ends up executing properly. The aggregate ends up in the correct state and there are no errors whatsoever.

Some of my wild guesses are that the database is not able to cope with that many inserts and it’s blocked somehow. Maybe it’s not the database but the DB connection pool configuration in our service? I don’t really know. In any case, any help/ideas or piece of advice would really be appreciated.

Our Event Store is in a MySQL database, we use snapshotting, we use caching… Please do let me know if there’s any other information that would be useful to understand this issue.

Thanks,
Armando.

Armando_Fernandez · June 19, 2020, 2:37pm

Just one small remark that I just realized about the code snipped above. It is actually slightly different from what I commented above and that may be the cause of the issue, let’s see what do you guys think. The corrected snipped would be:

`
@CommandHandler
fun handle(command: Command1) {
// Perform some checks with the aggregate data (super cheap operation)
AggregateLifecycle.apply(Event1())
// Make HTTPCall to a different service. This calls takes miliseconds, 200ms top, but normally way less. I have checked than in this particular case, the time for this request was < 200ms
val command: Command1 = buildCommandFromHttpResponseData()
someClass.handle(command)
}

class SomeClass {

@CommandHandler
fun handle(command: Command1) {
AggregateLifecycle.apply(Event2())
}
}

`

Ok, so the above code does not seem right (and it’s not), but there is an explanation. Command1 used to be sent from a different service but we changed the integration and we replaced it with a simple HTTP call (see snipped). It seems whoever did the change was a bit lazy about changing the rest of the flow and he just builds the command from the Http reponse and calls the old handler. That old handler is no longer is a real Axon @CommandHandler, it’s just a simple method in a separate class. That command (Command1) is not sent to the command bus in any part of the application. That annotation should have been removed and we should have also probably had to rename the Command1 class to something different to avoid confusions between what’s an Axon command and what is not…

I am wondering… The fact that that method is being called directly and that it is still annotated with @CommandHandler could be causing the difference in the times when the events are persisted in the event store?
Armando.

Giulio_Pulina · June 19, 2020, 11:05pm

Hi Armando,

I’m not an Axon Framework expert (I’m trying to learn it) so I can’t help much on that, but, in general, my suggestion would be to execute a profiler (like JProfiler) while running your load test, otherwise it would be very hard to spot the bottleneck(s) (for example, it could be caused by a code performance issue, a long GC pause, DB pool exhaustion…).

Thanks
Giulio