Cannot explain OutOfMemory error in query store

Hi,

Every time we do a full read on our event store we encounter an OutOfMemory error in our query store.

Our event store is a Postgres database with ~19 million events. A lot of events contain zipped data, few of them (~100 events) contain zipped data > 2mb. The biggest event contains data ~18mb, the second biggest event ~9mb.
Our query store is (for debugging purposes) a very simple -Spring boot- query store with one handler, only counting the number of handled events.
We use a tracking event processor with 32 threads, and a batch size of 2000.

When the query store is half way reading the event store (~9m events) it generates an OutOfMemory error which we cannot explain. Especially because the event store does do anything special -except for counting events. If we analyse the heap dump we see a lot of threads keeping the same event alive: either the 9mb event or the 18mb event.

If we set the batch size to 1000 all events are processed without an OutOfMemory error. We really would like to understand why increasing the batch size results in an OutOfMemory error, especially half way through.

Any clues?

If more info is needed, we 're happy to share it.

Thanks!

Hi Peter,

So, to what number did you set the batch size for your event-counting-tool?

The batch size effectively means that you have a UnitOfWork with that many event messages contained in it.

Thus, if you’re batch size is for example the ~19 million events you’re talking about, that means you’ll keep references to all those 19 million events, as the BatchingUnitOfWork is only finished and garbage collected afer all the events have been dealt with.
Any how, this is an assumption of mine. If you haven’t adjusted the batch size at all, then we need to look in a different direction.

Cheers,
Steven

Hi Steven,

The batch size was 2000.

But I think we know why we 're running OutOfMemory. If we looked through the Axon code correctly, each tracking processor thread does a full load on the event store (in batches). In this case this means that 32 threads encounter the 18Mb event eventually. The greater the batch, the higher the chance multiple threads load the 18Mb event at the same time.

The heap size was set to 512Mb. If all threads load the 18Mb event at the same time the query store will run OutOfMemory (32*18 > 512). I do assume the chance of that happening is very small, unless threads sync with each other, all loading the same batch at the same time. I presume the tracking processor threads are not synced? -meaning the speed at which they process events differ, resulting in a different event pointer within the event store for each thread.

Nevertheless, if we increase the batch size the chance increases multiple threads load a batch with the same (big) event, possibly resulting in a OutOfMemory error.

Are we correct?

Thanks!
Peter

Hi Pieter,

there is indeed no explicit synchronization. However, larger events are regularly slower to process, so it’s very likely that the threads will hit the same barrier simultaneously.

Note that having 32 threads sounds like an aweful lot, unless you have a machine with 32 cores. A batch size of 2000 is also very big. Generally, batch sizes are in the dozens to lower hundreds and number of threads is around 4 or 8.

Considering 32 * 18 is already more than the heap size you set, also consider the fact that the events also need to be deserialized and that other processes also use some heap space.
Having larger batch sizes makes it much more likely that all nodes try to load the same events, especially when the system is low on memory and needs to spend a lot of time GC-ing. This simply sets everything in motion to flood the heap completely.

Hope this clarifies things.
Cheers,

Allard

Hi Allard,

Thanks for the quick reply and explanation!

32 threads is a lot, but this was just for testing purposes. Normally our query stores run on 8 threads and a batch size of 1000. We 'll experiment with a smaller batch size.

Anyway, we do experience some other OutOfMemory issues. I 'll write a new post for that.

Thanks!
Peter

For cross reference purposes, see https://groups.google.com/forum/#!topic/axonframework/x3IeAbMqHB0 for the follow up.

Cheers,
Peter