Replay quietly dying

Matthieu_B · May 8, 2014, 8:45am

Hi,

after about 10 months using axon (version 2.0.8) I now need to use the replay feature in order to build a demo system for future customers. On the data side, the events (~20M) and the views are both stored in a single mongodb (version 2.4.9) instance with indexes to speedup most of our queries plus the indexes recommended in replay the performances tuning of the docs (section 10.1.2.).

Now the problem is that even after many attempts of getting a replay working I bumped into a couple of issues.

First, and the most important, I couldn’t get the replay replaying our entire event store. In all of our attempts I observed the replay stopping without reporting any error. I tried to check if there was a correlation with the event or event type being replayed before stopping but couldn’t find any similarities between runs.
Second, the performances of the replay where very bad. Even with a small event store and by following performances tips, most of our runs stopped after 2 days after replaying about half of the events.

Do you have any recommendations to address these issues?

Thanks,
Matthieu Bulté

Allard · May 8, 2014, 6:52pm

Hi Matthieu,

2 days for processing 10M events is very bad performance indeed. How did you create the index described in section 10.1.2? Did you create a single index with 2 fields (compound index)?
http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/#create-compound-indexes-to-support-several-different-queries

Regarding the failing of the replay, do you use an Executor to process the replay, or do you perform the replay in the calling thread? I noticed that when an error occurs, the exception is rethrown. In the case of running it in an executor, this will cause the running thread to “crash”. The errors are often written to std-err (System.err). The logs there might provide better insight in why the replay has failed.

As a side note: Axon should log errors when they occur during a replay. I will fix that in the next version.

Cheers,

Allard

Matthieu_B · May 9, 2014, 9:27am

Hi Allard,

Yes, I’ve created a compound index on the two fields specified in section 10.1.2.
For the replay context, I’m using a task executor wrapping the execution of the replay in a try/catch block to also log in case of any exception but still doesn’t catch anything.

I’ve observed at each run that approximately the first 5 million events being replayed have a reasonable speed of about 5ms per event and then the replay gets slower and slower with up to 1-2s per event and then just stops working. By logging some more information about the events being replayed, the decrease of speed didn’t only come from the increase in complexity of the events (because of course many of the first events replayed where just creating the aggregates) but from another factor that I can’t identify.

Even if it’s obviously very dependent on the kind of event being replayed, could you share some of the metrics you observed on replay duration?

Thanks,
Matthieu

Allard · May 9, 2014, 11:08am

Hi Matthieu,

the numbers are usually in the several hundreds up to a few thousand, depending on the serializer, type of event and number of listeners. This is conform the 5ms times you’re talking about.
The fact that it’s slowing down considerable is alarming. This might indicate there is a memory leak somewhere. Did you check the heap sizes and gc times? The slowdown could be cause by the garbage collector being unable to release resources. Theoretically, it could also be on the side of the Mongo server, although I doubt that.

Do you have any figures regarding memory consumption during the test? (a memory graph of JVisualVM would be helpful).

Cheers,

Allard

Matthieu_B · May 12, 2014, 12:36pm

Hi,

sorry for the delay but I come with more information.

I let the replay run on friday with as much monitoring as possible which gave interesting results.
The JVisualVM didn’t show any memory leak or anomaly.
But, the performances issues seemed to come from mongodb while replaying events accessing and modifying several large aggregates. That leaded to many IO operations probably because these large aggregates’ collections couldn’t be hold together in memory leading then to up to 2s per event being replayed.
All the other events are replayed at good speed (again, <5ms).

It happened also that an exception occurred after 15M events, which for some reason was not log in previous runs.

But I’m still quite puzzled by this performance issue from mongodb. Did you ever see such a situation?

Thanks for answering my questions until now.

Cheers,
Mattieu

Allard · May 13, 2014, 6:47pm

Hi Matthieu,

I am not a Mongo expert. Hiccups are quite normal when streaming large volumes of data this way, as page faults can always occur. However, the delays here do seem a bit too long.

One thing I was wondering about, why do you send commands when you’re replaying events? I don’t think you’d want to execute logic (again) based on events from the past. Normally, you only want to rebuild the current query model state. Maybe your replay triggers too much activity, which may also be the cause of the massive delays?

Cheers,

Allard

Matthieu_B · May 20, 2014, 9:55am

Hi Allard,

it seems that these delays come from mongodb updates, especially on large aggregates after most of the inserts have been performed. At this point, almost all aggregates are in the database and updates are performed over them changing their size (add/remove value objects) and leading to memory allocation related issues, mainly finding a new place to save the aggregate.

For sending commands, I don’t think I ever mentioned this but anyway no commands are being sent during replays so the only processing done on the java side is building and updating the read side.

Cheers,
Matthieu

Allard · May 21, 2014, 7:03am

Hi Matthieu,

so if I understand correctly, you’re storing information in documents on the query side, and these documents are getting large? It might be that MongoDB needs to create a new segment (2GB size file, filled with 0’s) to store the new information first. It might be worthwhile to see if there is a way to change the design of the query side to make it a bit more flexible.

But I’m glad to hear the problem isn’t on the EventStore side of things. That would have made me nervous
Cheers,

Allard