Slow Event Processing Rate causing huge backlog on the processors

Hello,

We are using Axon Framework 4.10.1
Our Application contains 3 saga processors which are lagging behind by almost 110 Million events. The latency metric shows that it’s almost 6 days behind the current timestamp.
I have tried to increase the batching size and played around with configuration properties of a pooled streaming event processors, spliting and merging the segments, but nothing seems to reduce the lag.

The application is running on a kubernetes cluster with 20 replicas.

The configurations for the Sagas are as follows.

`Function<String, ScheduledExecutorService> coordinatorExecutorBuilder =
name → Executors.newScheduledThreadPool(
1,
Thread.ofVirtual().name("[PSP] Coordinator - " + name, 0).factory()
);

Function<String, ScheduledExecutorService> workerExecutorBuilderExtended =
name → Executors.newScheduledThreadPool(
100,
Thread.ofVirtual().name("[PSP] Worker extended - " + name, 0).factory()
);`

Saga #1

    EventProcessingConfigurer.PooledStreamingProcessorConfiguration pspConfigExtended =
        (config, builder) -> builder
            .coordinatorExecutor(coordinatorExecutorBuilder)
            .workerExecutor(workerExecutorBuilderExtended)
            .initialSegmentCount(2)
            .batchSize(100)
            .tokenClaimInterval(10000)
            .claimExtensionThreshold(15000)
            .enableCoordinatorClaimExtension();

Saga #2

EventProcessingConfigurer.PooledStreamingProcessorConfiguration pspConfigForDemandSaga =
    (config, builder) -> builder
        .coordinatorExecutor(coordinatorExecutorBuilder)
        .workerExecutor(workerExecutorBuilderExtended)
        .initialSegmentCount(2)
        .batchSize(1000)
        .tokenClaimInterval(10000)
        .claimExtensionThreshold(15000)
        //.maxClaimedSegments(16) // enable if needed
        .enableCoordinatorClaimExtension();

Saga #3

EventProcessingConfigurer.PooledStreamingProcessorConfiguration pspConfigForRecordSaga =
    (config, builder) -> builder
        .coordinatorExecutor(coordinatorExecutorBuilder)
        .workerExecutor(workerExecutorBuilderExtended)
        .initialSegmentCount(2)
        .batchSize(500)
        .tokenClaimInterval(10000)
        .claimExtensionThreshold(15000)
        .enableCoordinatorClaimExtension();

We are using Postgres as our event store, the system metrics and query metrics on the DB show that DB isn’t the problem.

I tried to check if my event handling components in the Saga need any optimisations, but most of the event handling logic is trivial with only two event handlers that take ~ 600ms in handling those events.
One optimisation that I could use if to use CommandGateway.send instead of sendAndWait.

Can someone from the Axon team help, If I can try something else?