Record and Replay – realtime playback of system state

Alexey_Pakseykin · May 17, 2016, 9:22am

I’m studying the possibility of using Axon to implement a feature where changes in the (distributed) system state can be recorded and then replayed according to original time scale.

The original time scale is in the sense when frames recorded on a video tape are replayed with the speed they were recorded (hence, “frame” analogy – see below). This feature can be used for investigation (of previous incidents), simulation or training purposes. For example, think of a multiplayer game with some backend system and frontends (which are not necessarily dummy views and can also constitute part of the distributed system state).

“Frame”

The first obvious issue is the conceptual meaning of event in CQRS/ES. For the sake of clarity, I’ll use a different term “frame” for analogy to frames in video playback (as in “frames per second”) to disambiguate the meaning. Term “event” can be used strictly in its CQRS/ES meaning.

In context of CQRS/ES, event log is only used for re-storing states of aggregates from associated event store. My case is clearly different in that aspect: instead of restoring state of aggregates immediately (as soon as possible) from their event log, I want to use sequence of frames from frame store (analogous to the term “event store”) which reside on a playback server and apply them progressively on aggregates in their chronological order using original time scale.

I’m not saying that “frames” cannot be implemented by events (and mean the same thing). All I’m trying to do is to separate the meanings. Although events seem like the immediate candidate to model “frames”, I already have (in my head) possible implementation scenarios where “frames” may well be some special cases (special by additional metadata or completely different classes wrapping the original events).

External systems = outside question

I’m aware of problems related to replay anything on production system and accidental exchange with real external systems. Let’s assume the playback is done on another (isolated) instance of the system with all Gateways closed for outgoing messages and incoming messages from external systems are also simulated via this playback.

Processing order = outside question

Another problem is frame processing order in such distributed environment. Depending on implementation of the system, the order may be affected by differences in performance, latencies, thread scheduling which is generally nondeterministic. As the result, the playback may record frames in one order while each individual node may processed them in different. When such playback is used as source for “frames” in isolated system, the order reflects recorded rather than original real time.

Any suggestions?

All I want for now is to few ideas and right directions to dig and try deeper.

Is there existing infrastructure in Axon for this case?

What I can see is the classes like ReplayingCluster:

http://www.axonframework.org/apidocs/2.0/org/axonframework/eventhandling/replay/ReplayingCluster.html

However, ReplayingCluster is only intended to be used to rebuild view models (not state of aggregates):

https://groups.google.com/d/msg/axonframework/Azkao_xY0hE/ap-DFfwwaVEJ

“One thing to take special care with is to never replay your events on the EventBus. You’re very likely to have handlers there that don’t support replying, such as saga’s. Replaying them would cause commands to be generating, changing your application’s state, instead of rebuilding it.”

“In my design, a single Event Handler is responsible for updating one or more related tables. If I want to rebuild these tables, I clear them and replay all events from the event store on that single handler.”

In my case I think of a central playback server which feed “frames” at their delayed right time (not immediately).
So, using some sort of “frame bus” where playback is fed to their consumers sounds like the right solution against all the warnings.

Alexey_Pakseykin · May 18, 2016, 3:15pm

I think I just figured out two major mutually exclusive options.

Is there anyone out there who did similar record and replay server before?

“Simulation” versus “State Replay”

Without understanding this mutually exclusive use cases, an infinite loop may be created in your head. It took me a while to break out of it – thanks to ExhaustiveThinkingException.

One has to understand that Simulation and State Replay are mutually exclusive use cases (at least in CQRS/ES):

State Replay is replaying Events and forcing the system into exactly the same old state which it had at the specific point of time in the past.

Simulation is replaying Commands and letting the system to recompute its new state based on the same sequence of input commands.

The crucial difference is that the resulted states in this use cases can be different.

Simulated and Replayed states may diverge due to command processing order, fixed bugs, any other system differences between the time of recording and the time of replaying…

The nature of separation

I’ll emphasise.

Simulation and State Replay are different to each other the same way as Commands and Events are different in CQRS/ES.

There is one-to-one correspondence (individually or in pairs):

“Simulation” ~ “Command”

“State Replay” ~ “Events”

(“Simulation”, “State Replay”) ~ (“Command”, “Events”)

To avoid confusion from now on:

For clarity, terms “Command Replay” and “Event Replay” will mostly be used - they belong to CQRS/ES ubiquitous language. They describe exactly what’s going on in implementation.

Terms “Simulation” and “State Replay” will only be used to reference the two distinct use cases. It just happened that one use case replays Commands and another replays Events if CQRS/ES is used.

Simultaneous “Command Replay” and “Event Replay”

It is impossible to replay both at the same time.

Why?

It’s obvious.

It will create a mess.

Replayed commands will generate new events (similar to old events which are already part of event store) as a result of replaying the same commands.

How replaying old events fit together with new events (generated by replayed commands)?

It does not make any sense even if it can be done technically somehow.

Command suppression in “Event Replay” mode

For the same reason why Command and Events cannot be replayed together, Commands must be suppressed when Events are replayed.

Any Command in response to Event processing creates the same mess in replay mode as described before. Command generates new Events in addition to similar old Events being replayed.

In other words, during Event Replay, an aggregate state must be fed and rebuilt from event store only. This can be repeatedly seen as warning (unless it is supported by suppressing commands during event replay):

https://groups.google.com/d/msg/axonframework/Azkao_xY0hE/ap-DFfwwaVEJ

“One thing to take special care with is to never replay your events on the EventBus. You’re very likely to have handlers there that don’t support replying, such as saga’s. Replaying them would cause commands to be generating, changing your application’s state, instead of rebuilding it.”

Hybrid solution

Wait a minute!

I’ve just repeatedly said that Commands and Events cannot be replayed at the same time, haven’t I?

Yes. In its simple example for single aggregate, it is obviously not possible to make sense.

However, when the system is broken down into pieces carefully, things can become more flexible:

Some aggregates can use “State Replay” via replaying their Events.

Some aggregates can be under “Simulation” via replaying their Commands.

Some aggregates may even process live (not simulated) Commands from user or external system.

It all may become complex to debug and reason about, but the point is that conceptually it is not prohibited and can co-exist.

There are two different processes incompatible in their nature. By selectively replaying Commands or Events depending on the specific subsystem (Bounded Context with its own set of Aggregates?) hybrid solution is possible.

Abandon term “frames”

I always thought about the hybrid solution while unable to separate these use cases. This forced me to avoid calling messages by their exact name - I felt problems. So, instead, I introduced term “frames” to hide real type of replayed messages (see the first post).

I was stuck and confused. Not so anymore.

“Command store”

This term probably sounds insane for any CQRS/ES veteran, but look – again, I carefully surrounded it by quotes.

There is a need to store commands if they are supposed to be replayed for “Simulation” later.

Now, there is a problem to record commands because they are always addressed to specific aggregate (by their id):

There is always only single destination per command – an aggregate.

And the aggregate is busy with domain logic. The aggregate is not supposed to be the server to record commands.

However, I can think of some way to wrap all command received by this aggregate inside a special event (“Command X was captured”).

There are just nuances like:

Should we (somehow) intercept commands on the way to aggregate and call the special events as “Command X was sent”?

Or should we record commands when aggregate already received then and call the special events as “Command X was received”?

I like the first “was sent” type of event to capture the command because we can replay (“simulate”) the fact of sending command even if it never reached the aggregate.

But these are not important details for now.

The point is that any command can be wrapped and stored inside event store.

The special “command store” is not required.

Regards,
Alexey

Allard · May 19, 2016, 8:37am

Hi Alexey,

Event Replay isn’t the process of passing all events to an aggregate, but the process of reconstructing a view model (a.k.a. projection) based on the entire history of the application. This process should be seen as a last resort when changing the structure of a projection that cannot be fixed with a simple migration process.
Replaying (of events) can also be used to create new projections.

If you have a requirement to be able to replay (sorry, overloaded term) in the frontend, then basically that just a new projection. It is essentially a projection that you query that contains the relevant “frames”, as you called them, with their timestamps. Once you have that, it’s relatively easy to build the UI components for it, including a slider to pass time.
However, this query model might be very inefficient in storage, compared to the number of times it will be used. So you’d want to optimize it. One way to do that (and that’s the way I have done it in several occasions before), is to build the projects on-demand.

If you projects matches the aggregate boundaries on the command side, it is very easy to create an on-demand projection. When a query comes in, you simply read the aggregate event stream from the events store, and pass the events to an event handler instance that you have instantiated specifically for that query. The event handler updates to a specific moment in time and returns its state. Alternatively, you can create all frames in memory and return them to the UI together, depending on how much data that is. You can stream the frames, similar to streaming large movies.
Note that the event replay mechanism by Axon is not used for this solution.

If your aggregate boundaries on the command side do not match the scope of the query, you can use the visitEvents method on the EventStoreManagement interface (which all production-ready Axon provided Event Stores implement) to read all events from the event store. You probably want to use Criteria to limit the number of events read to those that are (likely to be) significant for the query.

I hope this helps.

Cheers,

Allard

Alexey_Pakseykin · May 25, 2016, 3:29am

Allard, thanks for the detailed answer!

I see. I should not stretch the meaning of “Event Reply” (and somehow hope that it can be adjusted for my use case).
I already sensed it before, but it was important to rule it out here explicitly.

“If you have a requirement to be able to replay (sorry, overloaded term) in the frontend, then basically that just a new projection.”

Yes. I also thought about this…

In analogy with jungle:

Why do I need to regrow all trees and animals in the jungle every time movie is replayed if all people care about is that flat rectangular TV screen showing the documentary?

Even if I could satisfy “Record and Replay” requirement using projection, I still kept on (unintentionally) mixing in another one - simulation, which is in the jungle analogy:

All the trees and animals are actually being regrown in controlled environment under the influence of some fake sun and some fake atmosphere.

Now, when I already separated these two use cases, I don’t see necessary exploring “generic record and replay of everything”.

However, I will still need to implement Simulation (replaying of Commands) which is simply going to be a different/separate and more straightforward solution now.

“It is essentially a projection that you query that contains the relevant “frames”, as you called them, with their timestamps. Once you have that, it’s relatively easy to build the UI components for it, including a slider to pass time.”

This! This clarification is actually new point to me.

Although I did think about pushing only replayed events (not commands) and only to some sort of projection (without feeding anything else within the system), but I still amaturely imagined that I’m going to implement some control on top of Event Store to release one “frame” at a time. I never had idea that I can simply feed all events to the projection in one shot and then implement control of that projection for the replay of “frames”.

Just to feel the difference:

Messing with some wrapper around Event Store to implement this additional responsibility of sending “frames” one by one at their right time of replay.

Just use framework as it expects to be used. Push all “frames” to another service in one shot. Enhance frontend with replay controls integrating the new service (the projection).

For the rest, what I can say is that my case is the second more generic one “aggregate boundaries on the command side do not match the scope of the query”.