Query fails due to connection error

danstoofox · October 18, 2022, 2:29pm

Hello,

I have received multiple complaints from customers that no data is displayed after the initial page load. After checking my logs I saw that my application is not able to send the query over the QueryGateway due to a connection error with Axon Server (we use AxonIQ Cloud). This usually happens in the morning. I assume that it is somehow related to the application not being used during the night.

2022-10-18 13:18:22.133 ERROR 1 --- [nio-8200-exec-8] o.a.c.c.C.[.[.[.[dispatcherServlet]      : Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/myService] threw exception [Request processing failed; nested exception is java.util.concurrent.ExecutionException: org.axonframework.axonserver.connector.query.AxonServerQueryDispatchException: UNAVAILABLE: io exception] with root cause

org.axonframework.axonserver.connector.query.AxonServerQueryDispatchException: UNAVAILABLE: io exception
        at org.axonframework.axonserver.connector.ErrorCode.lambda$static$17(ErrorCode.java:123) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at org.axonframework.axonserver.connector.ErrorCode.convert(ErrorCode.java:201) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at org.axonframework.axonserver.connector.ErrorCode.convert(ErrorCode.java:232) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at org.axonframework.axonserver.connector.ErrorCode.convert(ErrorCode.java:221) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at java.base/java.util.Optional.map(Optional.java:265) ~[na:na]
        at org.axonframework.axonserver.connector.query.AxonServerQueryBus$ResponseProcessingTask.run(AxonServerQueryBus.java:871) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at org.axonframework.tracing.Span.run(Span.java:72) ~[axon-messaging-4.6.0.jar:4.6.0]
        at org.axonframework.tracing.Span.lambda$wrapRunnable$0(Span.java:90) ~[axon-messaging-4.6.0.jar:4.6.0]
        at org.axonframework.axonserver.connector.PriorityRunnable.run(PriorityRunnable.java:58) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]

The query always works after the first retry. Before I add retries in my frontend I want to understand why this error occurs and if we could fix it somehow.

We also sometimes observe the following exception in our logs:

org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:178) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.hasNextAvailable(EventBuffer.java:144) ~[axon-server-connector-4.6.0.jar:4.6.0]
        at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:418) ~[axon-messaging-4.6.0.jar:4.6.0]
        at org.axonframework.eventhandling.TrackingEventProcessor.processingLoop(TrackingEventProcessor.java:318) ~[axon-messaging-4.6.0.jar:4.6.0]
        at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1145) ~[axon-messaging-4.6.0.jar:4.6.0]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR, debug data: max_age
        at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.44.1.jar:1.44.1]
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:479) ~[grpc-stub-1.44.1.jar:1.44.1]
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) ~[grpc-core-1.44.1.jar:1.44.1]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.44.1.jar:1.44.1]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) ~[grpc-core-1.44.1.jar:1.44.1]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) ~[grpc-core-1.44.1.jar:1.44.1]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.44.1.jar:1.44.1]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.44.1.jar:1.44.1]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        ... 1 common frames omitted

Axon seems to recover from this but this exception has also led to some command handlers being disconnected which resulted in numerous bug reports. After a restart everything was fine.

I hope someone can shed some light on this issue.

Vaish_Sriram · October 18, 2022, 7:39pm

Hi Daniel,

Thank you for reaching out.

A few questions here:

Can you confirm the workspace name, context name, and application name where this error occurs?
The application disconnects typically can happen due to a network blip. Can you provide more information on where the client application is hosted? Does the error occur on all the applications?

In the meantime, we will verify if there are any keep-alive parameters that can be increased.

danstoofox · October 19, 2022, 10:36am

In my local deployment using minikube I only observe the following exceptions. The other exceptions don’t occur.

2022-10-19 03:12:49.057  WARN 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [axonserver.cloud.axoniq.io:443] failed: UNAVAILABLE: Unable to resolve host axonserver.cloud.axoniq.io
2022-10-19 03:12:49.057  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
2022-10-19 03:12:51.057  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Connection to AxonServer lost. Attempting to reconnect...
2022-10-19 03:12:51.059  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from axonserver.cloud.axoniq.io:443
2022-10-19 03:12:51.060  WARN 1 --- [ault-executor-0] io.grpc.internal.ManagedChannelImpl      : [Channel<86>: (axonserver.cloud.axoniq.io:443)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host axonserver.cloud.axoniq.io, cause=java.lang.RuntimeException: java.net.UnknownHostException: axonserver.cloud.axoniq.io
        at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
        at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
       at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.UnknownHostException: axonserver.cloud.axoniq.io
        at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
        at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
       at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
        at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
        at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:631)
        at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219)
       ... 5 more
}

2022-10-19 03:12:36.039  WARN 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [axonserver.cloud.axoniq.io:443] failed: UNAVAILABLE: Unable to resolve host axonserver.cloud.axoniq.io
2022-10-19 03:12:36.039  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
2022-10-19 03:12:38.039  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Connection to AxonServer lost. Attempting to reconnect...
2022-10-19 03:12:38.040  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from axonserver.cloud.axoniq.io:443
2022-10-19 03:12:43.046  WARN 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [axonserver.cloud.axoniq.io:443] failed: DEADLINE_EXCEEDED: Deadline exceeded after 4.999355600s.
2022-10-19 03:12:43.046  INFO 1 --- [8c6b4d9-ndkxf-0] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
2022-10-19 03:12:44.045  WARN 1 --- [ault-executor-0] io.grpc.internal.ManagedChannelImpl      : [Channel<78>: (axonserver.cloud.axoniq.io:443)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host axonserver.cloud.axoniq.io, cause=java.lang.RuntimeException: java.net.UnknownHostException: axonserver.cloud.axoniq.io: Temporary failure in name resolution
        at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
        at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
        at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.UnknownHostException: axonserver.cloud.axoniq.io: Temporary failure in name resolution
        at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
        at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
        at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
        at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
        at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
        at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
        at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
        at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:631)
        at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219)
        ... 5 more
}

I do not observe this exception in our production environment.

Vaish_Sriram · October 19, 2022, 5:05pm

Thank you for confirming the information on the workspace name and usage. Which version of AxonFramework are you using?

In your client application, you can enable heartbeat monitoring. We have already enabled it on the Cloud AxonServer. Here’s more information on it: Heartbeat Monitoring - Axon Reference Guide. Heartbeat monitoring sends frequent monitoring between the client and the server connection. Any inactivity between the client and server for a few hours could be mitigated using the heartbeat.

If a connection is not used in 24 hours, the connection will be terminated but a new connection will automatically be started on reconnection. We wouldn’t be able to extend that time period (as in the case of the 2nd exception).

For the most recent exception posted, I have a few further questions:

Does the local warning occur frequently?
Is an application restart needed or does the hostname resolve itself?

danstoofox · October 20, 2022, 1:35am

I will try to configure the Heartbeat, thanks.

The most recent exception occurs very frequently but seems to resolve itself.

allardbz · October 20, 2022, 11:24am

Hi,
the UnknownHostException indicates a DNS lookup failure. This could be caused by the fact that the networking hasn’t been completely set up yet when the application starts. That’s probably why the problem seems to resolve itself.

Hope this helps.

danstoofox · October 20, 2022, 8:15pm

Hi @allardbz,

this exception does not occur on startup but rather after the application runs for a couple of hours (in my dev environment).

danstoofox · October 23, 2022, 5:18pm

@Vaish_Sriram I have updated Axon Framework to 4.6.1 and enabled the heartbeat feature. I have not noticed any improvements and the errors persist.

We have five different types of exceptions that are all thrown by the framework. The exceptions are so frequent that I have trouble finding our own application exceptions in our log file. We see the same exceptions in all our services. However, one service produces a lot more than the others. I have attached a log file, maybe you guys can figure it out.

Type A

Very frequent across all our services with heartbeat enabled. This exception is new.

org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:178) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.hasNextAvailable(EventBuffer.java:144) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:418) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processingLoop(TrackingEventProcessor.java:318) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1145) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.cleanUp(TrackingEventProcessor.java:1340) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.run(TrackingEventProcessor.java:1317) ~[axon-messaging-4.6.1.jar:4.6.1]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: HTTP status code 408
invalid content-type: text/html; charset=UTF-8
headers: Metadata(:status=408,content-type=text/html; charset=UTF-8,referrer-policy=no-referrer,alt-svc=h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,content-length=279)
DATA-----------------------------

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>408 Request Timeout</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Request Timeout</h1>
<h2>Your client has taken too long to issue its request.</h2>
<h2></h2>
</body></html>

        at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.0.jar:1.49.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        ... 1 common frames omitted

Type B

Very frequent across all our services. However, one of our services throws more frequently. I assume it is related to the amount of active processing groups.

2022-10-21 02:29:06.586  WARN 1 --- [ssor[workers]-0] o.a.e.TrackingEventProcessor             : Error occurred. Starting retry mode.

org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:178) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.peek(EventBuffer.java:137) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WrappedMessageStream.peek(TrackingEventProcessor.java:1078) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:433) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processingLoop(TrackingEventProcessor.java:318) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1145) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.cleanUp(TrackingEventProcessor.java:1340) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.run(TrackingEventProcessor.java:1317) ~[axon-messaging-4.6.1.jar:4.6.1]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.grpc.StatusRuntimeException: CANCELLED: [AXONIQ-0001] our-context@our-workspace: Timeout waiting for permits from client 1@workdutyplanning-775884945f-9rppn
        at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.0.jar:1.49.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        ... 1 common frames omitted

Type C

It is only thrown a few times in all our services.

2022-10-21 09:00:41.433  WARN 1 --- [essCompleted]-0] o.a.e.TrackingEventProcessor             : Error occurred. Starting retry mode.

org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:178) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.hasNextAvailable(EventBuffer.java:144) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:418) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processingLoop(TrackingEventProcessor.java:318) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1145) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.cleanUp(TrackingEventProcessor.java:1340) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.run(TrackingEventProcessor.java:1317) ~[axon-messaging-4.6.1.jar:4.6.1]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP status code 502
invalid content-type: text/html; charset=UTF-8
headers: Metadata(:status=502,content-type=text/html; charset=UTF-8,referrer-policy=no-referrer,date=Fri, 21 Oct 2022 09:00:41 GMT,alt-svc=h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,content-length=332)
DATA-----------------------------

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

DATA-----------------------------

        at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.0.jar:1.49.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        ... 1 common frames omitted

Type D

2022-10-21 10:21:01.775  WARN 1 --- [lanningPools]-0] o.a.e.TrackingEventProcessor             : Error occurred. Starting retry mode.

org.axonframework.axonserver.connector.AxonServerException: The Event Stream has been closed, so no further events can be retrieved
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.peekNullable(EventBuffer.java:178) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.axonserver.connector.event.axon.EventBuffer.hasNextAvailable(EventBuffer.java:144) ~[axon-server-connector-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:418) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor.processingLoop(TrackingEventProcessor.java:318) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$TrackingSegmentWorker.run(TrackingEventProcessor.java:1145) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.cleanUp(TrackingEventProcessor.java:1340) ~[axon-messaging-4.6.1.jar:4.6.1]
        at org.axonframework.eventhandling.TrackingEventProcessor$WorkerLauncher.run(TrackingEventProcessor.java:1317) ~[axon-messaging-4.6.1.jar:4.6.1]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.grpc.StatusRuntimeException: INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR
        at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.0.jar:1.49.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        ... 1 common frames omitted

Type E

Very frequent whenever we make the first API call from our frontend after the first load. It sometimes also occurs while navigating through the frontend (API calls are issued).

2022-10-21 10:52:29.639  INFO 1 --- [ult-executor-27] i.a.a.c.query.impl.QueryChannelImpl      : Error on QueryChannel for context our-context@our-workspace

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.0.jar:1.49.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Kind regards
Daniel

Application logs as PDF as .txt is not allowed.
app.log.pdf (267.9 KB)

danstoofox · October 28, 2022, 8:28pm

Any update @Vaish_Sriram?

allardbz · October 31, 2022, 10:26am

Hi Daniel,

by the looks of the exceptions, it looks like there is an intermediate network device/service that closes connections. Especially this text triggered me: “Your client has taken too long to issue its request”. gRPC uses two-way streaming, which effectively means both the request and the response (from an HTTP/2 perspective) stay open for a long time. Some load balancer or other network service may recognize HTTP/2 traffic and think it’s a request that takes too long, thus closing it.

What does your network topology look like?

danstoofox · October 31, 2022, 11:20am

Hi @allardbz, thank you for digging through our logs

We are currently hosted on AWS. Our application is placed in a private subnet and accesses the internet through a NAT gateway.

I googled a bit and the NAT gateway seems to close connections after idling 350s. The recommended solution is to decrease the keep-alive time below this threshold.

This is our current configuration

axon:
  axonserver:
    servers: ${AXON_URL}
    keep-alive-time: 0
    ssl-enabled: true
    context: ${AXON_CONTEXT}
    token: ${AXON_TOKEN}
    heartbeat:
      enabled: true

I thought the heartbeat would keep the connection from becoming idle. Perhaps I should adjust the keep-alive-time to something like 300000?

Why is it suggested to set this value to 0 in the default configuration?

Vaish_Sriram · October 31, 2022, 12:20pm

Hi Daniel,

The keep-alive-time on the Axon configuration indicates the interval at which timeout messages will be sent from AxonServer, setting it to 0 prevents the check. Setting this value to 0 will prevent the LB from closing connections. The heartbeat monitoring is an application-level ping between AxonServer and clients.

To resolve the timeout for the NAT gateway, please refer to AWS. The article here maybe useful: Troubleshoot connectivity when using a NAT gateway on a VPC

danstoofox · October 31, 2022, 12:52pm

@Vaish_Sriram I already found this page. A viable solution for us would be to set the TCP-Keepalive below 350s. How can we configure this in our Axon applications?

From what I understood, the keep-alive-time is not the same as TCP-Keepalive. What would be a use case to adjust the keep-alive-time?

Another possibility would be to add retries to all queryGateway and commandGateway calls. It would be great if this was provided out of the box.

allardbz · October 31, 2022, 1:06pm

I remember reading some discussions about HTTP/2 keep-alives, and the general consensus was to not use it, until you’re in a situation where you really need to. I can’t find the document right now, but it was somewhere in the gRPC repositories on GitHub.

TCP keepalive is to keep a connection alive between requests. It is indeed different from HTTP/2 keepalive, which is there to validate whether the connection is indeed still alive. The name is somewhat unfortunate, I think, because the setting itself doesn’t keep anything alive, but rather a liveness-check. But since lower-level constructs call it that way, we didn’t feel like changing the name in our configuration.

Looking at the AWS documentation Vaish referred to, they suggest setting the TCP Keepalive to lower than 350 seconds, which means connections will be closed by the client before the NAT Gateway will.

Given that one of your errors was about a request taking too long, I doubt setting a keep-alive will solve it. It looks like the gateway needs to be configured to allow for long-running HTTP/2 requests. Sometimes setting the gateway to TCP mode helps. In that case, they don’t attempt to “understand” the traffic that is going on. We had similar issues in the past when using WebSockets, which effectively does the same, but then using HTTP/1. Changing the load balancer to TCP mode solved it.

Note that using client-side heartbeat should prevent idle connection timeouts from happening. In that case, a regular message is sent over the connection, ensuring that the gateway recognizes it as active.

allardbz · October 31, 2022, 1:07pm

I believe this is currently being discussed. The main challenge is to discover transient exceptions. They aren’t always resolved in a single retry…

danstoofox · October 31, 2022, 1:14pm

Ok, I understand. I didn’t find any ability to configure something meaningful about the NAT in the AWS console.

We already enabled the heartbeat setting without any changes to our logs.

stefand · October 31, 2022, 1:32pm

Another possibility would be to add retries to all queryGateway and commandGateway calls. It would be great if this was provided out of the box.

Reactor Gateways have this possibilities, if you are willing to use reactive API and Reactor Extension