Axon Server was used in a single instance configuration inside a container writing its event store data to a mounted file system.
After updating the docker container from version 2024.1.4 to 2024.2.1, there was a fatal replication error:
2025-01-31T01:07:16.616Z INFO 1 — [Axon Server] [ault-raftNode-0] i.a.a.l.file.WritableFileStorageTier : default-SNAPSHOT: Segments initialized
2025-01-31T01:07:17.201Z INFO 1 — [Axon Server] [ault-raftNode-0] i.a.a.logging.ClusterEventsLogger : default: context default created, my role PRIMARY, min event token 0, min snapshot token 0
2025-01-31T01:07:17.226Z ERROR 1 — [Axon Server] [ault-raftNode-0] i.a.a.cluster.LogEntryProcessor : default: Apply failed FATALLY last applied : 3, commitIndex: 13
io.axoniq.axonserver.cluster.exception.FatalApplyException: [AXONIQ-2200] default: Replicated EVENT transaction 0 does not match stored transaction
at io.axoniq.axonserver.enterprise.replication.logconsumer.EventLogEntryConsumer.consumeLogEntry(iaa:141) ~[!/:na]
After restarting the instance one more time, it just started normally.
Some questions:
The log output of the previous version instance did not indicate any error on shutdown, so it can be assumed the data should be consistent?
A lot of clients immediately connected to the updated server instance that failed, can this disrupt the startup process ? (I assume not , as the server never is never elected, clients should not be able to send any query or commands? )
At the moment all the data for the container is ‘fresh’ , and only the event store is reused, if the consistency of the event store is assumed, is there any issue with this approach, e.g. would it be better to store all the data, as this is probably the way the server is generally tested ?
So far this was a single incident of this kind, the event store currently holds just short of 4 million events in a single context.
If the shutdown of Axon Server was clean, then I would not expect any data loss.
If you want to be a 100% certain that something like this does not cause data loss, I would suggest to use a distributed set up of Axon Server. An easier step-in model of this could be AxonIQ Console.
I would not expect issues when your (I assume) Axon Framework applications immediately start hammering Axon Server on the reconnect with more messages. Axon Server is smart enough to pause that flow. Furthermore, the exception you are receiving does not hint towards that area to me either.
Does this mean you throw out the controlDb/configDb (the latter is the current name, the former the old name) and the logs when upgrading? That should not be necessary and can very well be the reason why you got the aforementioned issue. If you are curious what’s contained in those, I would like to refer you to this part of our documentation.
Yes, this was the case, since then everything in the container was moved to a persistent store.
As network and hostname was not constant before in the container, this needed some extra axon server configuration steps so the internal identifiers wont change and the controlDb/configDb is recognized as its own.