Axon Framework GapAwareTrackingToken is removing gaps causing events never to be processed

Fish_Tyro · December 2, 2025, 4:34am

We use Axon Framework 4.12.1

We deploy a small service into K8s with 2 replicas using a shared MySql database.
(This deployment processes about 400 aggregates a day, but it’s usually quite burst-y … It does about 99% in the 10 minutes after midnight)

The issue is that it creates gaps, then the gaps are closed, then events are allocated to those gaps (and will never process).

Fish_Tyro · December 2, 2025, 4:53am

FYI Our logging shows that we have gaps, and then it narrows them at 23:40:42:

2025-11-24 23:40:42.795	
org.axonframework.eventhandling.tokenstore.jpa.TokenEntry{owner=1@merchant-settlement-disbursement-d76cd7fc5-xzt74, tokenType=org.axonframework.eventhandling.GapAwareTrackingToken, timestamp=2025-11-24T12:40:42.556Z, token=[123, 34, 105, 110, 100, 101, 120, 34, 58, 52, 54, 53, 52, 51, 44, 34, 103, 97, 112, 115, 34, 58, 91, 52, 54, 52, 55, 54, 44, 52, 54, 52, 55, 55, 44, 52, 54, 52, 55, 56, 44, 52, 54, 52, 55, 57, 44, 52, 54, 52, 56, 48, 44, 52, 54, 52, 56, 49, 44, 52, 54, 52, 56, 50, 44, 52, 54, 52, 56, 51, 44, 52, 54, 52, 56, 52, 44, 52, 54, 52, 56, 53, 44, 52, 54, 52, 56, 54, 44, 52, 54, 52, 56, 55, 44, 52, 54, 52, 56, 56, 44, 52, 54, 52, 56, 57, 44, 52, 54, 52, 57, 48, 44, 52, 54, 52, 57, 49, 44, 52, 54, 52, 57, 50, 44, 52, 54, 52, 57, 51, 44, 52, 54, 52, 57, 52, 44, 52, 54, 52, 57, 53, 44, 52, 54, 52, 57, 54, 44, 52, 54, 52, 57, 55, 44, 52, 54, 52, 57, 56, 44, 52, 54, 52, 57, 57, 44, 52, 54, 53, 48, 48, 44, 52, 54, 53, 48, 49, 93, 125]}
	which decodes to
	{"index":46543,"gaps":[46476,46477,46478,46479,46480,46481,46482,46483,46484,46485,46486,46487,46488,46489,46490,46491,46492,46493,46494,46495,46496,46497,46498,46499,46500,46501]}

2025-11-24 23:40:47.817	
org.axonframework.eventhandling.tokenstore.jpa.TokenEntry{owner=1@merchant-settlement-disbursement-d76cd7fc5-xzt74, tokenType=org.axonframework.eventhandling.GapAwareTrackingToken, timestamp=2025-11-24T12:40:46.923Z, token=[123, 34, 105, 110, 100, 101, 120, 34, 58, 52, 54, 53, 53, 50, 44, 34, 103, 97, 112, 115, 34, 58, 91, 52, 54, 53, 52, 52, 44, 52, 54, 53, 52, 53, 44, 52, 54, 53, 52, 54, 44, 52, 54, 53, 52, 55, 44, 52, 54, 53, 52, 56, 44, 52, 54, 53, 52, 57, 44, 52, 54, 53, 53, 48, 44, 52, 54, 53, 53, 49, 93, 125]}
	which decodes to
	{"index":46552,"gaps":[46544,46545,46546,46547,46548,46549,46550,46551]}

which means the following 26 indexes were in the first but not the second: [46476,46477,46478,46479,46480,46481,46482,46483,46484,46485,46486,46487,46488,46489,46490,46491,46492,46493,46494,46495,46496,46497,46498,46499,46500,46501]

Then at 2025-11-24 23:46:11.653 (6 minutes later) we got 2 inputs into our system. They went onto the CommandBus and created Events with 2 of the removed indexes 46476 and 46477.

These events effectively “stall” since they’ll never be picked up.

Then at 2025-11-25 00:17:11.647 (31 minutes later) we get another 24+ inputs into our system. As above, they go onto the CommandBus and create Events that store in the Event Store as the other 24 indexes that got skipped.
And as above, these events never get handled.

We see no errors in our logs and we don’t see any logging from our @EventHandlers - which is why i know that they are never handled.

Fish_Tyro · December 2, 2025, 5:06am

This is the key information from our Spring AxonConfiguration:

Single Axon processing group global
We have a Gap timeout of 5 minutes (in case the default one was causing the issue)
Simple Command Bus with the Spring Platform Transaction Manager.
Currently using the JpaTokenStore. But thinking of switching to the JDBCTokenStore soon for more control (and to start using a proper sequence instead of XXX_seq tables).
Created the standard Axon tables using liquibase on MySql 8.0.

Here’s some of the code:

    @Bean
    fun commandBus(platformTransactionManager: PlatformTransactionManager?) = SimpleCommandBus.builder()
        .transactionManager(SpringTransactionManager(platformTransactionManager))
        .build()

    @Bean
    fun entityManagerProvider(): EntityManagerProvider = ContainerManagedEntityManagerProvider()

    @Bean
    fun tokenStore(entityManager: EntityManager, serializer: Serializer): JpaTokenStore = JpaTokenStore.builder()
        .serializer(serializer)
        .entityManagerProvider { entityManager }
        .build()

    @Bean
    fun eventStorageEngine(
        defaultSerializer: Serializer,
        persistenceExceptionResolver: PersistenceExceptionResolver,
        @Qualifier("eventSerializer") eventSerializer: Serializer,
        configuration: org.axonframework.config.Configuration,
        entityManager: EntityManager,
        transactionManager: TransactionManager,
    ): EventStorageEngine = JpaEventStorageEngine.builder()
        .snapshotSerializer(defaultSerializer)
        .upcasterChain(configuration.upcasterChain())
        .persistenceExceptionResolver(persistenceExceptionResolver)
        .eventSerializer(eventSerializer)
        .snapshotFilter(configuration.snapshotFilter())
        .entityManagerProvider { entityManager }
        .transactionManager(transactionManager)
        .gapTimeout(GAP_TIMEOUT)
        .build()

    @Bean
    fun eventStore(eventStorageEngine: EventStorageEngine): EmbeddedEventStore = EmbeddedEventStore.builder()
        .storageEngine(eventStorageEngine)
        .build()

I think everything else is pretty standard.
Any suggestions on how to prevent the “gaps” being closed and then reallocated would be welcome.

Steven_van_Beelen · December 2, 2025, 11:18am

Preventing the gaps from occurring the first place would be your safest bet, @Fish_Tyro! Although you gave a lot of info, one of the things I cannot completely deduce is whether you may or may not be reusing the sequence generator for the domain_event_entry table. We have a write-down on the scenario where the reuse of the sequence generator is the reason for gaps to occur, which you can read here.

What it comes down to, is that we recommend that your domain_event_entry has a sequence generator that is not used by any other tables. In doing so, you would not get any artificial gaps in your event stream.

However, as stated at the start, I am not entirely certain whether the above is true. So, there’s another flow you can take, which is configuring the gap clean-up behavior. The settings for this are on the JpaEventStorageEngine, just like the gap timeout you already adjusted. What I think you’re missing in the settings there, is a customization of the gap cleaning threshold. This defaults to allow 250 gaps which, once hit, will clear them out.

Although I don’t know your domain as well as you do, it does help in most scenarios to use a distributed command bus. This stems from the consistent routing that will be employed when a distributed command bus is in place, ensuring that commands for a given aggregate are always handled by the same application instance. Again, whether this solves concurrency issues in your environment depends on these “burst-y” scenarios. If those are only creates, then I don’t think it’ll solve anything. If those contain multiple commands to the same aggregates, then it will streamline your process.