Event Store Corruption Issues

klauss42 · September 25, 2025, 11:11am

Hello,
it seems that our event store is corrupted, and I have no idea how to analyze or fix it.

Just for context:
Begin of September we performed a restore of the event store when migrating from AxonServer Standalone to an AxonServer Cluster. A restore was required because we had to change the underlying disks. The restore procedure was tested multiple times beforehand, even with production backups. The migration itself looked successful, and the AxonServers and applications appeared to work fine afterward.

Unfortunately, we later discovered errors that suggest there are “holes” or corrupted data in the event store.

Observations:

“Invalid sequence number for aggregate” errors
We see a number of “Invalid sequence number for aggregate” warnings in the AxonServer logs followed by a CommandExecutionException on client side:

AxonServer:

"logger": "io.axoniq.axonserver.message.event.SequenceValidationStreamObserver",
"message": "Invalid sequence number for aggregate  in context default. Received: 0, expected: 1",

Clients:

org.axonframework.commandhandling.CommandExecutionException: An exception has occurred during command execution
  Caused by: java.util.NoSuchElementException: No value present
  at java.base/java.util.Optional.get(Optional.java:143)
  at org.axonframework.eventsourcing.EventStreamUtils.lambda$upcastAndDeserializeDomainEvents$1(EventStreamUtils.java:88)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
  at io.axoniq.axonserver.connector.event.AggregateEventStream$1.tryAdvance(AggregateEventStream.java:75)
  at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:292)
...

Missing aggregates or events
We also found aggregate IDs (>1000) that exist in our projection tables but have no corresponding events in AxonServer, which is very confusing.

Additionally, we see aggregates in the AxonServer UI that appear incomplete. For example, an aggregate may show sequence numbers 2 to 5, but sequence numbers 0 and 1 are missing.

Since the projection tables contain data for these aggregates, we know they should exist, but they cannot be found in AxonServer. Naturally, sending commands for such aggregates results in errors.

Is there any way to analyze the event store to understand what is going on?
Could the index or bloom filter files be out of sync with the event files?
Are there any tools available that could help us?

We are running AxonServer v2025.1.4, with an event store size of approximately 107 GB.

Thanks in advance for your help.
Klaus

Marco_Amann · September 25, 2025, 4:26pm

Hello Klaus,

Let’s tackle the Invalid sequence number for aggregate part first.
The logs state Received: 0, expected: 1, which indicates that an append was attempted with sequence number 0 but axon server already has an event for that aggregate id in its store.
This can happen, if two command handlers attempt to append events at the same time (interleaved) or shortly after each other, given the eventual-consistent nature of the sourcing operation.
If this happens from time to time, this is not something to be concerned about.
If this happens permanently for each append, something is wrong and we need to dig a bit deeper.

The second part, Missing aggregates or events is much more severe.
There might be two reasons for this, both of which you have identified: Either the index is incomplete/broken or the actual data is missing.

In such a case, we normally recommend you to send us the diagnostics package, but that would be a bit cumbersome as you don’t have a dedicated support channel. Instead, you can manually verify that the files on the three nodes are the same. This rules out any issues introduced while copying the files during the restore. Please gather the checksums (e.g. sha1) of all .events files for the affected context. These have to be the same on all machines with the exception of the “active segment”, the one with the highest sequence number.
If all of these match, you can proceed to reindexing. If they do not match, something went wrong during the restore and we can continue from there.

Before we analyze the event store segments for defects, let’s first try with re-indexing. This is much easier. The procedure is as follows:

Stop one of the nodes
For the affected context, remove the .index, .bloom, .nindex, .xref files (not all of these may exist, depending on your setup).
Restart the node. It should now rebuild the index. That might take several hours and cause plenty of IO.
Wait until each .events file has a corresponding .index or .nindex file, except the active one (see above)
On the restarted node, search for the affected aggregates. This information will not propagate by replication, so please make sure you reach the correct one.

If the above worked fine, you can repeat the procedure for the remaining nodes.
Should the aggregates still be problematic, this requires a much more involved recovery procedure, for which we should have a call first.

Kind regards,
Marco

klauss42 · September 26, 2025, 8:01am

Hello Marco,
thanks for your reply.
I will try to walk through your procedures.

But one question: If I would start the index rebuild on one of the cluster nodes, will that node be active in the cluster during the reindex? Our apps have a list of all 3 cluster nodes in the axon.axonserver.servers list. Will the apps connect to the node and send commands/queries while reindex is running?

Klaus

Marco_Amann · September 26, 2025, 2:03pm

Hi Klaus,

that node will be partially active.

It will be able to forward messages, as well as operate normally for replication groups that are not reindexing.

It will not participate in raft with the affected replication group though. This means the other two nodes will complain about it being unavailable, but they can form a healthy majority.

Kind regards,
Marco

klauss42 · October 6, 2025, 8:11am

Hello,
I have finally managed (after some other hurdles) to reindex the EventStore in all AxonServer nodes. The reindexing has helped somewhat, at least some of the incomplete aggregate sequences seem to be fixed.

More problematic, however, is that there is apparently actually an event gap in my EventStore:

There is a token gap in the EventStore:

66625400, 2025-09-01T15:13:57.726Z UTC
66679562, 2025-09-01T17:18:40.625Z UTC

This means 54162 events are missing from a 2-hour period. I can easily display this via the AxonServer UI using the query “token>=66625399 and token<=66679565”:

It is of course critical that events have been lost “somehow”, but much more critical is that replays are currently no longer working.

During replay, the following exceptions now occur:

java.util.NoSuchElementException: No value present
    at java.base/java.util.Optional.get(Optional.java:143)
    at org.axonframework.eventsourcing.EventStreamUtils.lambda$upcastAndDeserializeDomainEvents$1(EventStreamUtils.java:88)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at io.axoniq.axonserver.connector.event.AggregateEventStream$1.tryAdvance(AggregateEventStream.java:75)
    at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:292)
    at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
    at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:169)
    at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:298)
    at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
    at org.axonframework.eventsourcing.eventstore.IteratorBackedDomainEventStream.hasNext(IteratorBackedDomainEventStream.java:54)
    at org.axonframework.eventsourcing.eventstore.ConcatenatingDomainEventStream.hasNext(ConcatenatingDomainEventStream.java:76)
...

So far, I have found no way to circumvent this exception. It seems as if replays always “get stuck” at the same point. The EventProcessor aborts with the exception and is then usually restarted in another instance of the application. The same thing happens there, and then you have a wild back-and-forth jumping of EventProcessor instances, all “getting stuck” at the same point. The only workaround so far: Set the replay to a point in time after the event gap.

This raises the following questions:

How can the replay problem be fixed?
How could such an event gap have occurred?
Can the event gap be repaired?

I could probably live with the events being lost. Some of the data can probably be replayed. But the gap in the replay is a problem.

@Marco_Amann, @Marc_Gathier any help is appreciated.

Klaus

Marc_Gathier · October 6, 2025, 9:23am

Are the same events missing on all nodes?
It looks like one “events” file is missing, or the last file before 00066679562.events is truncated. Can you check on the other nodes, and possibly compare the md5 hash of the last event file before 66679562?

klauss42 · October 6, 2025, 9:54am

Hi Marc,
the events files are the same for all nodes (checked with sha1sum). But this is clear to me because on Sept 1st, we were still running on a single-node AxonServer. We migrated to the cluster 2 days later.

In the events directory I have the following files around the gap:

-rw-rw-r--. 1 root       axonserver 268435456 Aug 30 14:55 00000000000066359278.events
-rw-rw-r--. 1 root       axonserver 268435456 Sep  1 15:15 00000000000066528320.events
-rw-rw-r--. 1 root       axonserver 268435456 Sep  1 17:50 00000000000066679562.events
-rw-rw-r--. 1 root       axonserver 268435456 Sep  2 02:00 00000000000066863568.events

I guess the token gap is in the 00000000000066679562.events file, right?
Do you have any tools to look into the events file for deeper analysis?

Klaus

klauss42 · October 6, 2025, 1:25pm

I must admit that the stacktrace I mentioned above (from EventStreamUtils) was probably not correct. The above exception only occurs when commands are called, i.e., when EventSourcing is performed.
Currently, I cannot find any exception that occurs during replay, but replays definitely get stuck due to the event gap.

klauss42 · October 7, 2025, 11:01am

Hello again,
I managed to reproduce the replay issues in a lab setup with a copy of our event store.
If I perform a replay of a single event processor here are the exceptions that we see if the replay reaches the token position mentioned as gap before:

2025-10-07 12:50:54  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
	at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:177)
	at org.axonframework.axonserver.connector.event.axon.EventBuffer.hasNextAvailable(EventBuffer.java:143)
	at org.axonframework.common.stream.BlockingStream.hasNextAvailable(BlockingStream.java:45)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.lambda$scheduleCoordinationTask$23(Coordinator.java:1068)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2025-10-07 12:50:54  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] is releasing claims and scheduling a new coordination task in 500ms   
2025-10-07 12:50:55  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 0.   
2025-10-07 12:50:55  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 1.   
2025-10-07 12:50:55  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 2.   
2025-10-07 12:50:55  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 3.   
2025-10-07 12:50:55  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed 4 new segments for processing   
2025-10-07 12:50:57  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
	at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:177)
	at org.axonframework.axonserver.connector.event.axon.EventBuffer.hasNextAvailable(EventBuffer.java:143)
	at org.axonframework.common.stream.BlockingStream.hasNextAvailable(BlockingStream.java:45)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.lambda$scheduleDelayedCoordinationTask$24(Coordinator.java:1079)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: io.axoniq.axonserver.connector.AxonServerException: [AXONIQ-0001] Stream closed on client request
	at io.axoniq.axonserver.connector.impl.FlowControlledBuffer.close(FlowControlledBuffer.java:83)
	at io.axoniq.axonserver.connector.impl.AbstractBufferedStream.close(AbstractBufferedStream.java:125)
	at org.axonframework.axonserver.connector.event.axon.EventBuffer.close(EventBuffer.java:194)
	at org.axonframework.common.io.IOUtils.closeQuietly(IOUtils.java:47)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.abortAndScheduleRetry(Coordinator.java:1105)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:843)
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.lambda$scheduleCoordinationTask$23(Coordinator.java:1068)
	... 7 common frames omitted
2025-10-07 12:50:57  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] is releasing claims and scheduling a new coordination task in 500ms   
2025-10-07 12:50:57  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 0.   
2025-10-07 12:50:57  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 1.   
2025-10-07 12:50:57  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 2.   
2025-10-07 12:50:57  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 3.   
2025-10-07 12:50:57  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed 4 new segments for processing   
2025-10-07 12:50:58  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
java.lang.NullPointerException: Cannot invoke "org.axonframework.common.stream.BlockingStream.hasNextAvailable()" because "this.eventStream" is null
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2025-10-07 12:50:58  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] is releasing claims and scheduling a new coordination task in 500ms   
2025-10-07 12:50:58  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 0.   
2025-10-07 12:50:58  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 1.   
2025-10-07 12:50:58  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 2.   
2025-10-07 12:50:58  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 3.   
2025-10-07 12:50:58  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed 4 new segments for processing   
2025-10-07 12:50:59  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
java.lang.NullPointerException: Cannot invoke "org.axonframework.common.stream.BlockingStream.hasNextAvailable()" because "this.eventStream" is null
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2025-10-07 12:50:59  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] is releasing claims and scheduling a new coordination task in 500ms   
2025-10-07 12:50:59  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 0.   
2025-10-07 12:50:59  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 1.   
2025-10-07 12:51:00  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 2.   
2025-10-07 12:51:00  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 3.   
2025-10-07 12:51:00  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed 4 new segments for processing   
2025-10-07 12:51:00  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
java.lang.NullPointerException: Cannot invoke "org.axonframework.common.stream.BlockingStream.hasNextAvailable()" because "this.eventStream" is null
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Questions:

why is the event stream closed, as mentioned by the exception?
after the first exception we see an NPE and the processor keeps looping forever

2025-10-07 12:58:36  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
java.lang.NullPointerException: Cannot invoke "org.axonframework.common.stream.BlockingStream.hasNextAvailable()" because "this.eventStream" is null
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2025-10-07 12:58:36  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] is releasing claims and scheduling a new coordination task in 500ms   
2025-10-07 12:58:36  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 0.   
2025-10-07 12:58:36  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 1.   
2025-10-07 12:58:36  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed 2 new segments for processing   
2025-10-07 12:58:36  WARN [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Exception occurred while Processor [InvoiceProjectionProcessorV2] was coordinating the work packages.   
java.lang.NullPointerException: Cannot invoke "org.axonframework.common.stream.BlockingStream.hasNextAvailable()" because "this.eventStream" is null
	at org.axonframework.eventhandling.pooled.Coordinator$CoordinationTask.run(Coordinator.java:815)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2025-10-07 12:58:36  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] is releasing claims and scheduling a new coordination task in 500ms   
2025-10-07 12:58:36  INFO [nvoiceProjectionProcessorV2]-0] o.a.eventhandling.pooled.Coordinator               : Processor [InvoiceProjectionProcessorV2] claimed the token for segment 0.   

...

Any ideas on how to fix the replay for our obviously broken event store?

Klaus

Marco_Amann · October 7, 2025, 12:47pm

Hi Klaus,

Let’s start with the gap itself:
It seems that as you described it above, 66625401 to 66679561 (both inclusive) are missing. As the sequence numbers increase after the gap, AS continued to run. Based on the directory listing you shared, i suspect the gap to be the (missing) end of the 00000000000066528320 file, as 00000000000066679562 is the next file and that one contains the first event after the gap.

We had a case very similar to this, where a customer copied AS while it was running, copying the currently written-to segment and then not overwriting the file when copying the next time. (Maybe they even restored from that partial backup at some point overwriting the healthy files, i am not 100% sure anymore)

As this has happened only once, we don’t have great tooling ready. We do have a tool that can identify and patch gaps but that is super fragile as it was written for a specific use case.

Could you please check, if by any chance (as it was the case in the previous case), you have a file 00000000000066528320.events.1 or something comparable, created by the backup tool?

The event-store-gap-fixing-tool can insert empty events so that the event store is complete. However, the original events are lost for sure (maybe you are lucky with the .1 file mentioned above). There is no way from our side to know what would have been inside these events. With a lot of manual work and a bit of luck, one could parse the index files, that maybe got copied completely and reconstruct which aggregates are affected. But that goes way beyond what we can offer as community support.

We can provide you with the gap fix tool via DM along a bunch of detailed instructions, so that you can fix the gap itself. For that, let’s wait to see whether the .1 file options saves the day!

Kind regards,
Marco

klauss42 · October 7, 2025, 3:22pm

Hello Marco, hello Marc,

thanks for your hint. We were lucky!

We actually had an old backup of the previous standalone AxonServer and the 00000000000066528320.events file was indeed different (sha1sum) than the one in our PROD cluster. We replaced the one events file (and reindexed) and, hooray, the events were back!

It must have been caused by an error during backup/restore.

For the future, I would wish for better tooling from AxonServer for such problems:

Easier reindexing: Can’t this be triggered via a CLI?
Consistency check of the events files: We simply didn’t know where the problem could be. A tool for checking the events files would be great. It can also run in the background for some time, but then you can at least verify whether, for example, a restore or a migration of an AS was successful

Many thanks for the help

Klaus

Marco_Amann · October 10, 2025, 2:51pm

Hello Klaus,
happy to hear you were able to restore the event file from the backup!

Easier reindexing via the CLI or the web-UI/API would be awesome, we have this on our todo-list already.

AS does a consistency check for the last few segment files during startup but i fully agree with you, a full consistency check would be good. We have discussed that in the past, however, depending on the size of the event store, verifying consistency would be a huge task. We have customers with dual-digit terabytes of events, just reading the data to compute a checksum would take quite some time. We do have such a tool, it verifies the CRCs of the events and checks for gaps (the one i mentioned above). However, one needs to be very careful with this as it has no IO limiting, hence it would use all the IO bandwidth it can get, potentially disrupting more important reads.
Having this built into AS would be great, agree with you there.

Enjoy your weekend!
Cheers,
Marco