Error on InstallSnapshot stream: CANCELLED

Hello, a little bit of context, I was using Axonserver v4.4.8.1 and we planned to update to version 4.6.8 a few days ago.

After the upgrade, I started to see the following logs:

  • This installation is running on EKS
  • We weren’t using snapshots in version 4.4.8.1
  • After the upgrade process I had to restart the Axonserver nodes (and we already had applications connected to it), perhaps this caused the issue.

logs from the node leader

2023-05-09 18:42:35.279  INFO 1 --- [uest-executor-1] i.a.a.cluster.grpc.GrpcRaftPeer          : Error on InstallSnapshot stream: CANCELLED: Failed to read message.
2023-05-09 18:42:35.280  INFO 1 --- [uest-executor-1] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: Install snapshot message failed. Reason: CANCELLED: Failed to read message..
2023-05-09 18:42:35.280  INFO 1 --- [uest-executor-1] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: Changing state from Install Snapshot Replicator Peer State to Append Entries Replicator Peer State.
2023-05-09 18:42:35.281  INFO 1 --- [uest-executor-2] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: Sending append entry to axonserver-2-0-0206a7a3-cb10-458b-9c9a-85ba5e9cfd7d failed. Reason: default in term 249: previous term/index missing 249/3533 last log 3533. Last applied index: 0, match index: 0.
2023-05-09 18:42:35.281  INFO 1 --- [uest-executor-2] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: follower axonserver-2-0-0206a7a3-cb10-458b-9c9a-85ba5e9cfd7d is far behind the log entry. Follower's next index: 1.
2023-05-09 18:42:35.281  INFO 1 --- [uest-executor-2] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: Changing state from Append Entries Replicator Peer State to Install Snapshot Replicator Peer State.
2023-05-09 18:42:35.281  INFO 1 --- [uest-executor-2] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: start snapshot installation
2023-05-09 18:42:35.303  INFO 1 --- [uest-executor-2] i.a.axonserver.cluster.ReplicatorPeer    : default in term 249: Install snapshot config done received: responseHeader {
  requestId: "bd148c23-93a0-4dfc-ac49-22069346da60"
  responseId: "3dee79d6-8f1e-4f08-beb9-4640ca9e24ee"
  nodeId: "axonserver-2-0-0206a7a3-cb10-458b-9c9a-85ba5e9abefabd"
}
groupId: "default"
term: 249
requestIncrementalData {
  lastReceivedOffset: 1
  lastEventTokenPerContext {
    key: "prod"
    value: 725039
  }
  lastSnapshotTokenPerContext {
    key: "prod"
    value: -1
  }
}
, matchIndex 0, nextIndex 1, lastEvents: {prod=725039}

and logs from the node that is having issues

2023-05-09 18:44:54.163  INFO 1 --- [ster-executor-4] i.a.axonserver.cluster.FollowerState     : default in term 249: previous term/index missing 249/3533 last log 3533
2023-05-09 18:44:54.169  INFO 1 --- [ster-executor-4] i.a.a.e.logging.ClusterEventsLogger      : default in term 249: Leader is axonserver-0-0
2023-05-09 18:44:54.170  INFO 1 --- [ster-executor-4] i.a.axonserver.cluster.FollowerState     : default in term 249: Install snapshot started.
2023-05-09 18:44:54.170  INFO 1 --- [ster-executor-4] i.a.a.c.r.file.PrimaryLogEntryStore      : default: Clearing log entries, setting last index to 3533
2023-05-09 18:44:54.170  INFO 1 --- [ster-executor-4] i.a.a.c.r.file.PrimaryLogEntryStore      : default: Removing 3534 segment
2023-05-09 18:44:54.171  INFO 1 --- [ster-executor-4] i.a.a.c.r.file.PrimaryLogEntryStore      : default: Initializing log
2023-05-09 18:44:54.171  INFO 1 --- [ster-executor-4] i.a.a.c.r.file.PrimaryLogEntryStore      : default: File for segment 3534 does not exist. Creating new file with size of 16777216.
2023-05-09 18:44:54.202  INFO 1 --- [ster-executor-1] i.a.axonserver.cluster.FollowerState     : default in term 249: previous term/index missing 249/3533 last log 3533
2023-05-09 18:44:54.207  INFO 1 --- [ster-executor-1] i.a.a.e.logging.ClusterEventsLogger      : default in term 249: Leader is axonserver-0-0
2023-05-09 18:44:54.207  INFO 1 --- [ster-executor-1] i.a.axonserver.cluster.FollowerState     : default in term 249: Install snapshot started.
2023-05-09 18:44:54.207  INFO 1 --- [ster-executor-1] i.a.a.c.r.file.PrimaryLogEntryStore      : default: Clearing log entries, setting last index to 3533
2023-05-09 18:44:54.208  INFO 1 --- [ster-executor-1] i.a.a.c.r.file.PrimaryLogEntryStore      : default: Removing 3534 segment
2023-05-09 18:44:54.217  INFO 1 --- [ster-executor-1] i.a.a.c.r.file.PrimaryLogEntryStore      : default: Initializing log
2023-05-09 18:44:54.217  INFO 1 --- [ster-executor-1] i.a.a.c.r.file.PrimaryLogEntryStore      : default: File for segment 3534 does not exist. Creating new file with size of 16777216.

Is there anything I can do to fix this issue? It looks like the node isn’t able to restore itself. What can I do to debug more?

Hi Vitor,
Are there any warning messages logged on one of the Axon Server nodes? It appears that the leader gets an error on the stream from the new node, and this restarts the install snapshot process. Do all nodes have the same value for the max-message-size?

Marc