Axon Server Passive Backup nodes

pals · February 20, 2023, 2:58pm

In a two AWS region configuration with

3 primary nodes in Region-A and 1 passive Backup node in Region-B then
after an AMI refresh of the Passive Backup node, i.e., it was offline for a short period,
when the Passive Backup node is up and running, will it automatically “catch-up” with the leader in Region-A?

Also is there an Observability metric for how far behind the passive backup node is lagging behind the Leader node, in terms of “Lag in the #Events replicated”?

Bert_Laverman · February 21, 2023, 9:25am

Hi!

To start with your first question: yes, the node will get any missing events from the current leader of the Replication Group for which it is a Passive Backup. However, I wonder what your worry is when discussing “lagging.” In general, all nodes get a continuous stream of event transactions. You can expect the new events to be on a Backup node (in a different region) within a reasonable amount of milliseconds. Typical networking delays across the pond, for example, are between 100 and 200 ms unless your Cloud provider happens to have a exceptionally fast link.

That said, the Backup node will never be fully up-to-date because continuously new transactions are being committed. This is what we call “Eventual Consistency”: the data is consistent, but it may take a moment before the latest changes have been applied. You could retrieve Raft-related metrics from the “/internal/raft/status” endpoints on the Axon Server nodes that reveal the latest committed event and the last applied event. Still, the value of such a measurement is mainly on a “comforting” level. If the gap is large, you should have other (networking) metrics to help you point out possible reasons.

Suppose you are worried that a potential disaster will cause an unacceptable data loss. In that case, you should employ (a pair of) Active Backups, as the transactions require commits from those and a majority of the Primary nodes.

Cheers,
Bert Laverman

pals · February 22, 2023, 5:48pm

Many thanks Bert. Yes agree with you and accept “eventual consistency” and network latency. My concern was from a disaster recovery perspective. Will try out the observability metric just as a comfort factor during the initial period after go-live deployment.