Event Processors Hanging in Axon Framework v4.9.0

We have encountered a situation that causes the Axon GapAwareTrackingToken to fail to advance and process new events. It’s worth noting that we did not face this problem when using Axon Framework version 4.5.15, Spring Boot 2.7, and XStream as serialization.

Application Information:

  • Axon Framework version: 4.9.0
  • Spring Boot version: 3.2.1
  • JDK Version: OpenJDK 17
  • Event Store: PostgreSQL version 13
  • Number of instances: 4
  • Number of stored events: 88000
  • Serialization: Jackson
  • Configuration: No special configuration

Token Entry Table

Expected Behavior:
The tracking processors should advance as usual and process new events.

Actual Behavior:
New events are not being processed.

Workaround:
Restarting the services resolves the issue.

Questions

  1. Are there any differences in the behavior of GapAwareTrackingToken between Axon Framework v4.5.15 and v4.9.0 that could explain the issue?
  2. Are there any known issues or bugs in Axon Framework v4.9.0 related to event processing?
  3. Could there be any compatibility issues between Axon Framework v4.9.0 and other components in the system, such as Spring Boot 3.2.1 or PostgreSQL version 13?
  4. How does the use of Jackson serialization compare to XStream serialization in terms of compatibility and performance with Axon Framework v4.9.0?
  5. Are there any specific error messages or logs indicating why the new events are not being processed?

Additionally, we found the following GitHub issue, and we believe it may be related to our problem:

Thank you

Can you maybe try is version 4.9.3 of framework and/or bom version 4.9.4 solves the problem? Recent changes were making it need less memory, which introduced a bug, which has been fixed in the latest.

Thank you for your reply. We actually first encountered the problem with Axon v4.9.3, and then based on the feedback provided on the GitHub issue, we downgraded to version 4.9.0.

We’re not using the BOM, we’re depending directly on the spring boot axon starter dependency.

It’s hard to say in that case. Maybe the load has increased and/or because there are already a lot of events, things are being slowed down to a point the gaps become too big?

I don’t think our system is under a heavy load, I believe ~90,000 events is a small number, and we haven’t changed anything in our infrastructure recently.

We’ve enabled pg_stat_statements for PostgreSQL to gain deeper insights into our application’s behavior; we found out that select queries on the token entry table are time-consuming, averaging ~200 minutes for 519 calls.

What are your thoughts on the cause of this issue? Do you think it could be related to the bug you mentioned in Axon v4.9.x?

Do you think downgrading to Axon v4.8.3 would resolve the issue?

It’s worth mentioning that we’re utilizing the same PostgreSQL instance that we previously used with Axon v4.5.13.

I would advise to keep 4.9.3, as that also solved the issue for the screenshotted mmaask. Only they used the bom, so where actually first using 4.9.2 instead of 4.9.3. If the issue persists, please open an issue on GitHub. What is a large number also depends on the hardware, and the size of the events. With so many changes it’s hard to pin it down.

Maybe it’s another problem entirely? Do you see any errors from the event processors that stopped?

Yes, I agree; you’re correct about the load on the system, but I don’t think that’s the case because before updating the service we reached millions of events in the database and the service was working normally.

We enabled debug logs for Axon, no errors have been detected from the event processors. Today for example, we attempted to launch a process (we call it a dry run job). The command intended to create the aggregate is CreateDryRunJobCmd. This command is expected to publish a DryRunJobCreatedEvent, which was indeed persisted in the domain event entry. However, this event was not handled by any event handlers (saga or projection).

I’ve attached the logs from when this command was dispatched.

So the question remains why the event wasn’t handled?

yes exactly that’s my question

I would expect some logging from the event processor that should point in a direction. It’s unlikely to just have skipped an event.

The issue isn’t that an event was skipped; rather, the problem is that all event processors are not handling any events, leading to the application freezing.

But there isn’t any logging that it’s stopped processing events?

Unfortunately no, and there are no error logs either

Can you please tell me in which version these changes were introduced?

4.9.1, Release Axon Framework v4.9.1 · AxonFramework/AxonFramework · GitHub, Resolved apparent memory leak in GapAwareTrackingToken by abuijze · Pull Request #2936 · AxonFramework/AxonFramework · GitHub

1 Like