io.grpc.StatusRuntimeException: CANCELLED: Retries exhausted: 3/3

Hello,

I’m facing below issue when a command is invoked for a Aggregate command handler.

This Aggregate is designed to handle upto 10000 status update commands and publishes upto 10000 events that updates the aggregate state in Event Sourcing Handlers. After some time this aggregate stops accepting any command and below error is thrown.
I’m using AxonServer EE 4.5.7 with 4 nodes (3 primary and 1 active backup)

[commandA] execution failed:  io.grpc.StatusRuntimeException: CANCELLED: Retries exhausted: 3/3 {}","logger_name":"xxx.interceptor.CommandInterceptor","thread_name":"CommandProcessor-2","level":"WARN","level_value":30000,"stack_trace":"io.axoniq.axonserver.connector.impl.StreamClosedException: io.grpc.StatusRuntimeException: CANCELLED: Retries exhausted: 3/3
	at io.axoniq.axonserver.connector.event.impl.BufferedAggregateEventStream.hasNext(BufferedAggregateEventStream.java:81)
	at io.axoniq.axonserver.connector.event.AggregateEventStream$1.tryAdvance(AggregateEventStream.java:73)
	at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
	at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
	at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
	at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
	at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
	at org.axonframework.eventsourcing.eventstore.IteratorBackedDomainEventStream.hasNext(IteratorBackedDomainEventStream.java:54)
	at org.axonframework.eventsourcing.eventstore.ConcatenatingDomainEventStream.hasNext(ConcatenatingDomainEventStream.java:76)
	at org.axonframework.eventsourcing.EventSourcingRepository.doLoadWithLock(EventSourcingRepository.java:125)
	at org.axonframework.eventsourcing.EventSourcingRepository.doLoadWithLock(EventSourcingRepository.java:52)
	at org.axonframework.modelling.command.LockingRepository.doLoad(LockingRepository.java:128)
	at org.axonframework.modelling.command.LockingRepository.doLoad(LockingRepository.java:56)
	at org.axonframework.modelling.command.AbstractRepository.lambda$load$4(AbstractRepository.java:122)
	at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1133)
	at org.axonframework.modelling.command.AbstractRepository.load(AbstractRepository.java:121)
	at org.axonframework.modelling.command.AggregateAnnotationCommandHandler$AggregateCommandHandler.handle(AggregateAnnotationCommandHandler.java:460)
	at org.axonframework.modelling.command.AggregateAnnotationCommandHandler$AggregateCommandHandler.handle(AggregateAnnotationCommandHandler.java:449)
	at org.axonframework.modelling.command.AggregateAnnotationCommandHandler.handle(AggregateAnnotationCommandHandler.java:172)
	at org.axonframework.modelling.command.AggregateAnnotationCommandHandler.handle(AggregateAnnotationCommandHandler.java:60)
	at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:57)
	at org.axonframework.messaging.interceptors.LoggingInterceptor.handle(LoggingInterceptor.java:83)
	at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:55)
	at org.axonframework.messaging.interceptors.BeanValidationInterceptor.handle(BeanValidationInterceptor.java:67)
	at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:55)
	at org.axonframework.messaging.interceptors.CorrelationDataInterceptor.handle(CorrelationDataInterceptor.java:65)
	at org.axonframework.messaging.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:55)
	at org.axonframework.messaging.unitofwork.DefaultUnitOfWork.executeWithResult(DefaultUnitOfWork.java:74)
	at org.axonframework.commandhandling.SimpleCommandBus.handle(SimpleCommandBus.java:177)
	at org.axonframework.commandhandling.SimpleCommandBus.doDispatch(SimpleCommandBus.java:143)
	at org.axonframework.commandhandling.SimpleCommandBus.dispatch(SimpleCommandBus.java:111)
	at org.axonframework.axonserver.connector.command.AxonServerCommandBus$CommandProcessingTask.run(AxonServerCommandBus.java:274)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: io.grpc.StatusRuntimeException: CANCELLED: Retries exhausted: 3/3
	at io.grpc.Status.asRuntimeException(Status.java:533)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:617)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:803)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:782)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\t... 3 common frames omitted

Aslo, One thing I noticed is that from other that leader nodes, I’m not able to lookup for any aggregate events for which the command handling failed with above error. When i check Read from leader active context, I was able to see the events.

Regards,
Roy

Hello @S.Roy,

A question(s): This is just a simple aggregate that is handling one command and publishing one event, right? Is there any subscribing event handler that is projecting these event(s) in some DB?

It is strange that you can only browse events once the Read from leader active context is selected.

Hi @Ivan_Dugalic , No this is simple aggregate handling one command and publishing one event only. There is no subscribing event handler involved in this case.

Yes, it is strange as events for other transactions published after this are searchable in all the axon server nodes.

Regards,
Roy

Also, To improve performance I’m using EventCountSnapshot. For this transaction there is discrepancy in number of snapshots when Read from leader active context is selected

@S.Roy thanks for sharing. I am raising awareness of this problem!

@S.Roy can you share the logs from the Axon Server nodes? The error indicates that on the Axon Server side there is an exception while reading the aggregate, it is retrying 3 times and then giving up. The logging on the Axon Server node should provide information on why it is not able to read the aggregate.

Hi @Marc_Gathier ,

I can’t share the axon server logs here, However it was shared over the slack channel.
The root cause seemed to be the growing snapshot size which was going beyond the default max message size of 4mb.

The quick fix was to increase the max message size in both axon server and clients.

axon.axonserver.max-message-size=8388608
axoniq.axonserver.max-message-size=8388608

After that, we have done some clean up in the aggregate to bring down the max snapshot size for this aggregate.
@Ivan_Dugalic, Thank you for your inputs.

Regards,
Roy