We had a change to our query model that required replaying around 100,000 of our events from our event log. Not a huge volume of data, but when we tried running the migration, we were dismayed to find that it was taking hours to run. The Java process was mostly idle but PostgreSQL was pegging a CPU.
The problem turned out to be that PostgreSQL was doing a poor job of planning the execution of the queries against the domainevententry table. For each batch of events, even though there was an appropriate index, it was doing a full table scan and a sort before returning any results, which took around 15 seconds. The default JDBC event store has a batch size of 100, so we would have ended up doing 1000 of these slow queries to finish the replay.
Our workaround: increase the batch size on the JDBC event store to 5000 during the replay. This doesn't cause PostgreSQL to execute the queries more efficiently, but it causes Axon to issue about 2% as many of them, so our replay finishes in minutes instead of hours.
Hopefully that workaround will be helpful to other PostgreSQL users.
Now for more details.
There are two issues here, a big PostgreSQL one and a small Axon one. We are filtering by aggregate type for our replay, so the generated SQL looks like
SELECT eventIdentifier, aggregateIdentifier, sequenceNumber, timeStamp, payloadType, payloadRevision, payload, metaData
FROM DomainEventEntry e
WHERE (
(e.timeStamp > ?)
OR (e.timeStamp = ? AND e.sequenceNumber > ?)
OR (e.timeStamp = ? AND e.sequenceNumber = ? AND e.aggregateIdentifier > ?))
AND (type = ?)
ORDER BY e.timeStamp ASC, e.sequenceNumber ASC, e.aggregateIdentifier ASC;
There's an index on the timeStamp column which does end up getting used by this query when the starting timestamp is close to the end of the time range in the event log, but is ignored during the vast majority of the replay. However, if you remove the two OR clauses, it *does* use the index, and adding a LIMIT clause has no effect on the execution plan, so it doesn't seem to be purely a "cost of reading index is too high for the expected row count" situation. The query also gets fast if you transform it into three SELECTS with UNION ALL instead of one SELECT with OR. So PostgreSQL is definitely not doing the optimal thing here.
That said, Axon should ideally be including a LIMIT clause on that query to tell the database that it only plans to read a limited number of rows. That doesn't end up helping in this particular case, at least in my ad-hoc testing, but it seems worthwhile to give the DB the opportunity to optimize for returning the first N rows rather than the entire result set.
The query could also be made simpler by using a row value expression:
SELECT eventIdentifier, aggregateIdentifier, sequenceNumber, timeStamp, payloadType, payloadRevision, payload, metaData
FROM DomainEventEntry e
WHERE (e.timeStamp, e.sequenceNumber, e.aggregateIdentifier) > (?, ?, ?)
AND (type = ?)
ORDER BY e.timeStamp ASC, e.sequenceNumber ASC, e.aggregateIdentifier ASC
LIMIT ?;
Not sure how that'd affect compatibility with all the databases Axon supports, but row value expressions are valid SQL-92 so maybe it'd be okay. Although this is the same meaning as the OR version, PostgreSQL seems to be more willing to use an index for it. Not sure if there'd be a performance impact on other DBs but I suppose it's easy to imagine that it's more optimizer-friendly than the version with the OR clauses.
-Steve