Event Processors Hanging in Axon Framework v4.9.0

alainelkhoury · April 5, 2024, 8:24am

We have encountered a situation that causes the Axon GapAwareTrackingToken to fail to advance and process new events. It’s worth noting that we did not face this problem when using Axon Framework version 4.5.15, Spring Boot 2.7, and XStream as serialization.

Application Information:

Axon Framework version: 4.9.0
Spring Boot version: 3.2.1
JDK Version: OpenJDK 17
Event Store: PostgreSQL version 13
Number of instances: 4
Number of stored events: 88000
Serialization: Jackson
Configuration: No special configuration

Token Entry Table

Expected Behavior:
The tracking processors should advance as usual and process new events.

Actual Behavior:
New events are not being processed.

Workaround:
Restarting the services resolves the issue.

Questions

Are there any differences in the behavior of GapAwareTrackingToken between Axon Framework v4.5.15 and v4.9.0 that could explain the issue?
Are there any known issues or bugs in Axon Framework v4.9.0 related to event processing?
Could there be any compatibility issues between Axon Framework v4.9.0 and other components in the system, such as Spring Boot 3.2.1 or PostgreSQL version 13?
How does the use of Jackson serialization compare to XStream serialization in terms of compatibility and performance with Axon Framework v4.9.0?
Are there any specific error messages or logs indicating why the new events are not being processed?

Additionally, we found the following GitHub issue, and we believe it may be related to our problem:

github.com/AxonFramework/AxonFramework

Weird state causing tracking processors to never advance

opened 01:01PM - 10 Jan 24 UTC

closed 01:33PM - 23 Feb 24 UTC

haraldk

Type: Bug Status: Information Required Priority 1: Must

This is not necessarily a bug, but I want to try to understand what has happened… and why (and ultimately find a way to prevent this from happening again). We have come across a situation that causes Axon (`GapAwareTrackingToken`) to enter a weird state where the `TrackingEventProcessor` will *not* enter error state, but still fail to advance or catch up. During the period the problematic events were written, the system was under heavy load (due to a full reindexing of the data to Elastic). This may have caused some delay in committing transactions for Axon, but I am unsure whether this contributed/caused the problem or if this just happened at the same time. ### Basic information * Axon Framework version: 4.9.1 * JDK version: OpenJDK 21 (Temurin) ``` openjdk 21 2023-09-19 LTS OpenJDK Runtime Environment Temurin-21+35 (build 21+35-LTS) OpenJDK 64-Bit Server VM Temurin-21+35 (build 21+35-LTS, mixed mode, sharing) ``` ### Steps to reproduce The `DomainEventEntry` table has some interesting events, here's an extract, showing the problematic index, token index and indexes in gaps: | id | timestamp | eventIdentifier | |----------|----------------------------|--------------------------------------| | 10711007 | 2024-01-09 13:25:50.872000 | 81a8045f-642b-49d5-ae9d-43f18213bb4a | | 10711006 | 2024-01-09 13:25:50.442000 | 22d2c1c6-536f-43dd-8e37-82867222329e | | 10711005 | 2024-01-09 13:26:05.321000 | 99341edd-4380-4afb-9812-df3b5ccb7de9 | | 10711004 | 2024-01-09 13:26:05.309000 | 790ae785-41ae-4e44-8915-2c2c807fc28a | | 10711003 | 2024-01-09 13:25:47.921000 | 118a4b2f-c3b6-4423-9ced-691700493f2e | | 10711002 | 2024-01-09 13:25:47.915000 | b4148e48-e256-4af7-8759-c8744302f847 | | 10711001 | 2024-01-09 13:25:19.924000 | bfb1441c-4499-4d95-90b3-d1aaac237240 | | 10711000 | 2024-01-09 13:25:19.923000 | 25781f93-70a1-4286-af15-19e483585899 | | 10710999 | 2024-01-09 13:25:19.708000 | 6f7d72d7-454d-4699-b474-a3bef2fe90a0 | | 10710998 | 2024-01-09 13:25:19.685000 | 36bb414f-fc3b-4add-937b-0b0f700ebe87 | | 10710997 | 2024-01-09 13:24:21.404000 | be12da7e-299c-40fa-af14-17f568ba6780 | The timestamps for the ids 10711003, 10711004, 10711005, and 10711006 seem to indicate that events were stored slightly out of order, but I'm not sure why this happens or if it is a problem. This may have been caused by the heavy load, as mentioned above. `TokenEntry` table contains the following information (for all persistent tokens): `{"index":10711005,"gaps":[10710998,10710999,10711000,10711001]}`. ### Expected behaviour The tracking processors should just advance as normal and finally catch up. ### Actual behaviour The error message `"The given index [10710999] should be larger than the token index [10711005] or be one of the token's gaps [[]]"` is repeated over and over. However, the `TokenEntry` table shows the following information: `{"index":10711005,"gaps":[10710998,10710999,10711000,10711001]}`. This means that either the log message is incorrect showing an empty list, or the gaps considered by the `GapAwareTrackingToken.advanceTo` is inconsistent with the actual gaps stored in the database... I'm not sure what the `maxGapOffset` passed to the method is (perhaps it should be included in the log message too, for easier debugging?), but I believe it should be `1000` based on our configuration. I tried to follow the logic in the `GapAwareTrackingToken.advanceTo` method, but I don't understand why the `gaps` set is empty. Full stack trace: ``` java.lang.IllegalArgumentException: The given index [10710999] should be larger than the token index [10711005] or be one of the token's gaps [[]] at org.axonframework.eventhandling.GapAwareTrackingToken.advanceTo(GapAwareTrackingToken.java:133) at org.axonframework.eventsourcing.eventstore.jdbc.JdbcEventStorageEngine.getTrackedEventData(JdbcEventStorageEngine.java:690) at org.axonframework.eventsourcing.eventstore.jdbc.JdbcEventStorageEngine.lambda$executeEventDataQuery$34(JdbcEventStorageEngine.java:550) at org.axonframework.common.jdbc.JdbcUtils.executeQuery(JdbcUtils.java:85) at org.axonframework.common.jdbc.JdbcUtils.executeQuery(JdbcUtils.java:55) at org.axonframework.eventsourcing.eventstore.jdbc.JdbcEventStorageEngine.executeEventDataQuery(JdbcEventStorageEngine.java:543) at org.axonframework.eventsourcing.eventstore.jdbc.JdbcEventStorageEngine.lambda$fetchTrackedEvents$32(JdbcEventStorageEngine.java:523) at org.axonframework.common.transaction.TransactionManager.fetchInTransaction(TransactionManager.java:70) at org.axonframework.eventsourcing.eventstore.jdbc.JdbcEventStorageEngine.fetchTrackedEvents(JdbcEventStorageEngine.java:514) at org.axonframework.eventsourcing.eventstore.BatchingEventStorageEngine.lambda$readEventData$1(BatchingEventStorageEngine.java:148) at org.axonframework.eventsourcing.eventstore.BatchingEventStorageEngine$EventStreamSpliterator.tryAdvance(BatchingEventStorageEngine.java:289) at java.base/java.util.Spliterator.forEachRemaining(Spliterator.java:332) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) at org.axonframework.eventsourcing.eventstore.EmbeddedEventStore$EventProducer.fetchData(EmbeddedEventStore.java:262) at org.axonframework.eventsourcing.eventstore.EmbeddedEventStore$EventProducer.run(EmbeddedEventStore.java:228) at org.axonframework.eventsourcing.eventstore.EmbeddedEventStore$EventProducer.access$2800(EmbeddedEventStore.java:205) at org.axonframework.eventsourcing.eventstore.EmbeddedEventStore.lambda$ensureProducerStarted$0(EmbeddedEventStore.java:149) at java.base/java.lang.Thread.run(Thread.java:1583) ``` ### Workaround Stopping the tracking event processors, resetting the index to `10711006` (the index after the one that failed) before restarting caused the processors to continue processing and eventually catch up (possibly losing processing of the "gap" events). Restarting the application (and thus resetting any in-memory tokens) also removed the problem for some processors.

Thank you

Gerard · April 5, 2024, 8:30am

Can you maybe try is version 4.9.3 of framework and/or bom version 4.9.4 solves the problem? Recent changes were making it need less memory, which introduced a bug, which has been fixed in the latest.

alainelkhoury · April 5, 2024, 8:42am

Thank you for your reply. We actually first encountered the problem with Axon v4.9.3, and then based on the feedback provided on the GitHub issue, we downgraded to version 4.9.0.

We’re not using the BOM, we’re depending directly on the spring boot axon starter dependency.

Gerard · April 5, 2024, 9:37am

It’s hard to say in that case. Maybe the load has increased and/or because there are already a lot of events, things are being slowed down to a point the gaps become too big?

alainelkhoury · April 5, 2024, 11:14am

I don’t think our system is under a heavy load, I believe ~90,000 events is a small number, and we haven’t changed anything in our infrastructure recently.

We’ve enabled pg_stat_statements for PostgreSQL to gain deeper insights into our application’s behavior; we found out that select queries on the token entry table are time-consuming, averaging ~200 minutes for 519 calls.

What are your thoughts on the cause of this issue? Do you think it could be related to the bug you mentioned in Axon v4.9.x?

Do you think downgrading to Axon v4.8.3 would resolve the issue?

It’s worth mentioning that we’re utilizing the same PostgreSQL instance that we previously used with Axon v4.5.13.

Gerard · April 5, 2024, 11:24am

I would advise to keep 4.9.3, as that also solved the issue for the screenshotted mmaask. Only they used the bom, so where actually first using 4.9.2 instead of 4.9.3. If the issue persists, please open an issue on GitHub. What is a large number also depends on the hardware, and the size of the events. With so many changes it’s hard to pin it down.

Maybe it’s another problem entirely? Do you see any errors from the event processors that stopped?

alainelkhoury · April 5, 2024, 12:04pm

Yes, I agree; you’re correct about the load on the system, but I don’t think that’s the case because before updating the service we reached millions of events in the database and the service was working normally.

We enabled debug logs for Axon, no errors have been detected from the event processors. Today for example, we attempted to launch a process (we call it a dry run job). The command intended to create the aggregate is CreateDryRunJobCmd. This command is expected to publish a DryRunJobCreatedEvent, which was indeed persisted in the domain event entry. However, this event was not handled by any event handlers (saga or projection).

I’ve attached the logs from when this command was dispatched.

Gerard · April 5, 2024, 12:40pm

So the question remains why the event wasn’t handled?

alainelkhoury · April 5, 2024, 12:41pm

yes exactly that’s my question

Gerard · April 6, 2024, 6:32am

I would expect some logging from the event processor that should point in a direction. It’s unlikely to just have skipped an event.

alainelkhoury · April 6, 2024, 9:56am

The issue isn’t that an event was skipped; rather, the problem is that all event processors are not handling any events, leading to the application freezing.

Gerard · April 6, 2024, 10:16am

But there isn’t any logging that it’s stopped processing events?

alainelkhoury · April 8, 2024, 6:24am

Unfortunately no, and there are no error logs either

alainelkhoury · April 8, 2024, 6:25am

Can you please tell me in which version these changes were introduced?

Gerard · April 8, 2024, 7:47am

4.9.1, Release Axon Framework v4.9.1 · AxonFramework/AxonFramework · GitHub, Resolved apparent memory leak in GapAwareTrackingToken by abuijze · Pull Request #2936 · AxonFramework/AxonFramework · GitHub