Axon Server EE: Platform integration causes Raft pre-vote loop on fresh PVC initialization

Summary

When Axon Server Enterprise 2025.2.4 starts with fresh PVCs (no prior data) and standalone-dcb=true, having the Axoniq Platform connection active causes the _admin Raft context to get stuck in an infinite pre-vote loop. The default DCB context is never created, and all client connections fail with AXONIQ-1302: default: not found in any replication group.

Environment

  • Axon Server Enterprise: 2025.2.4
  • Mode: Single-node, standalone-dcb=true
  • Deployment: Kubernetes StatefulSet (GKE Autopilot)
  • Platform: Axoniq Platform Community (license valid, 1 node)

Steps to Reproduce

  1. Delete all Axon Server PVCs (data, events, log)
  2. Start Axon Server with AXONIQ_PLATFORM_AUTHENTICATION env var set (valid Platform token)
  3. Observe _admin context stuck in pre-vote loop indefinitely

Expected Behavior

Axon Server should:

  1. Initialize the _admin Raft context and elect itself leader (single-node, no peers)
  2. Create the default DCB context
  3. Connect to Axoniq Platform

Actual Behavior

The _admin Raft context cycles between FollowerState and PreVoteState every ~1-2 seconds and never transitions to CandidateState or LeaderState:

_admin in term 0: Timeout in follower state: 1403 ms.
_admin in term 0: Updating state from FollowerState to PreVoteState
_admin: Starting pre-vote from axon-server-xxx in term 0
_admin in term 0: Updating state from PreVoteState to FollowerState (received pre-vote with term (1 >= 0))
_admin in term 0: Pre-vote granted for axon-server-xxx.
_admin in term 0: Request for pre-vote received ... voted true (handled as follower).
[repeats indefinitely]

The Platform connection succeeds and triggers initialization:

Received new license from Axoniq Platform: [active=true, plan=Axoniq Platform Community]
Initialization of this node with following contexts: [default]
_admin: init replication group
_admin in term 0: Starting the node...

But then the _admin context falls back into the pre-vote loop. The error logged is:

ERROR: Failed to apply Axoniq Platform license to the cluster: [AXONIQ-2100] No leader for _admin

The /v1/public/me endpoint shows adminNode: false, contextNames: [], storageContextNames: [].

Root Cause Analysis

The Platform’s AxoniqConsoleCoordinationService fires ReplicationGroupChangesHandler from a second thread before InitClusterTask completes the Raft leader election for _admin. This appears to cause a ConcurrentMembershipStateModificationException (observed in prior occurrences) that corrupts the Raft state machine, preventing leader election from completing.

The pre-vote succeeds (the node votes for itself) but never transitions to a full election (CandidateState), suggesting the Raft state machine is in an inconsistent state after the concurrent modification.

Workaround

Temporarily disable the Platform integration during fresh initialization:

  1. Remove/comment out the AXONIQ_PLATFORM_AUTHENTICATION env var
  2. Start Axon Server - it initializes cleanly within seconds:
    _admin in term 1: Leader
    default: init replication group
    default in term 1: Leader
    Creating DCB context: default
    
  3. Re-enable the Platform integration - Axon Server reconnects successfully

Impact

  • Every fresh PVC initialization requires manual intervention (disable Platform, init, re-enable)
  • This affects disaster recovery, environment provisioning, and CI/CD pipelines
  • We have encountered this issue on 2 separate occasions (Jan 2026, Feb 2026)

Suggested Fix

The Platform integration should defer ReplicationGroupChangesHandler until after InitClusterTask has completed and the _admin Raft context has an elected leader. Alternatively, the initialization should be atomic and resilient to concurrent Platform callbacks.

Hi Petr,
Thanks for the detailed report! We’ve logged this as a bug internally and our team will look into it. In the meantime, the workaround you described should help.
Best!

1 Like