Confused About Event Replay Behavior in Axon – Need Help Understanding

Hi all,

I’m relatively new to Axon and attempting to understand how event replay actually occurs in the real world. I’ve been doing some testing around replaying events to reconstruct projections, and it appears to work okay in a local environment. But I’m a bit confused about what actually occurs under the hood when you’re replaying events in a production environment.

For instance, does Axon support automatic duplicate processing, or do we need to implement idempotent logic ourselves? What’s the most efficient way of initiating a replay for a given projection without impacting the entire system? I don’t want any downtime or interference with the live processing.

I’ve gone through the official docs and some tutorials, but I’d love to hear from experienced users or contributors here who have dealt with replays at scale. Any gotchas, best practices, or insights would be really appreciated.

Thanks in advance! Just trying to get better at designing robust event-driven systems using Axon.

1 Like

Hi,

I am not an Axon developer, but we do have some experience with using Axon in a production environment, so I will try and give some insights.

Axon does not solve the issue with the duplicate processing, so you yourself have to make sure that the handlers are idempotent. This becomes even more important when you want to use the dead letter queues, as Axon does no longer guarantee a one-time-delivery but instead it becomes an at-least-once-delivery - even without replay. This is usually easier said that done, so we have only very few parts where we actually implemented idempotency for the dead letter queues. However, if possible you should design at least your command handlers to be idempotent right from the start.

So maybe let’s put the issue with idempotency aside. When we handle events we usually have handlers for the projections (think JPA based entities) and handlers with side effects (sending mails for instance) in the same processing group. Those handlers usually rely on the projections and thus have a higher order so they are executed after the projection handlers. When we decide that we want to replay the handlers (for instance because some data is missing in the projections) we reset the processing group. The handlers for the projections have a reset handler which deletes the whole projection. The handlers with side effects are annotated with @DisallowReplay which means that those side effects are not executed a second time.

However, we do have some situations in which we want to execute some of those side effects during a replay (for instance if we have to send some missing data via commands to another system). In this case we use a custom replay context (which can be added during the replay of the processing group) and use it to decide which side effects should be executed during the replay and which should not. We developed an own replay context containing so called tags and a derivate of the @DisallowReplay annotation which recognizes those tags (so we can say: The replay is disallowed for this handler except for some specific tags).

Now about downtime. We reset the processing groups as part of our database migration (we actually use Liquibase to mark certain processing groups to be reset after start) and thus as part of the installation of a new version. Those installations usually come with downtime at the weekend anyway and we don’t expect users to work with the application right away. So we accept that some projections are not up to date when the application has just been restarted. However, we are well aware that the replay takes some time when there are a lot of events in the system already.

I hope that answers at least some of your questions.

Best regards

Nils

1 Like

I have nothing to add to Nils his reply here! Pretty sure that’ll help you quite a bit, @mathew_25.

However, I do want to react on this part of your post:

I’d wager you were missing some parts in the documentation if you came here. Sure, that’s what the forum is for, but I am always on the lookout to improve the write-down we have. The most extensive description we currently have is likely within our course material, specifically this course. Nonetheless, if you see any specific ‘missers’ you would have expected, @mathew_25, I am all ears.