Intermitent error when connecting to axon server

Hi all,

(Axon Server 4.4.10, Axon Framework 4.4.5 for Java)

Our services sometimes have trouble establishing connections to Axon Server. We sadly don’t have any way to reliably reproduce the error at the moment, but we noticed that the first service to connect to a context can do so successfully (100% of the time), while the ones that subsequently connect to the same context usually don’t work properly.

The service we have issue with has to connect to two contexts. I don’t know if this has any impact. It is also a “whitelabelled” application, meaning that we deploy multiple (5+) instances of that application.

Here’s how we noticed that we can have only one instance of our service running:

  1. We stopped all our dev instances
  2. We started our-app-label-aaa (it’s working)
  3. We started our-app-label-bbb (it’s not working)
  4. Stop our-app-label-aaa and our-app-label-bbb
  5. We started our-app-label-bbb (it’s working)
  6. We started our-app-label-aaa (it’s not working)

We did the same kind of combination for some other labels (eg: ccc, ddd, …) and we noticed that the first app we start was always OK. The second one, was (~90% of the time) unable to send a message to the context properly.

So we’re wondering how to debug and find what we should fix for this? Is it a connection limit on the server side (is there a sort of connection pool)? Or is there anything in the client we could fix? What are the logs we should enable at the DEBUG or TRACE level to understand what’s going on?

The full stack trace is below. We have Isito in our stack and it might help us pinpoint the issue. The client side receives a “connection reset”:

org.axonframework.axonserver.connector.query.AxonServerQueryDispatchException: UNAVAILABLE: upstream connect error or disconnect/reset before headers. reset reason: local reset
    at org.axonframework.axonserver.connector.ErrorCode.lambda$static$16(ErrorCode.java:112)
    at org.axonframework.axonserver.connector.ErrorCode.convert(ErrorCode.java:182)
    at org.axonframework.axonserver.connector.ErrorCode.convert(ErrorCode.java:213)
    at org.axonframework.axonserver.connector.ErrorCode.convert(ErrorCode.java:202)
    at java.util.Optional.map(Unknown Source)
    at org.axonframework.axonserver.connector.query.AxonServerQueryBus$ResponseProcessingTask.run(AxonServerQueryBus.java:722)
    ... 5 common frames omitted
Wrapped by: java.util.concurrent.CompletionException: org.axonframework.axonserver.connector.query.AxonServerQueryDispatchException: UNAVAILABLE: upstream connect error or disconnect/reset before headers. reset reason: local reset
    at java.util.concurrent.CompletableFuture.reportJoin(Unknown Source)
    at java.util.concurrent.CompletableFuture.join(Unknown Source)
    at com.our.co.our.app.controller.OurController.findByNumberAndClientNumber(OurController.java:35)
... 19 frames excluded
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    at or...

Note that we sometimes see this as well:

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 3.401354560s. [buffered_nanos=3599511605, waiting_for_connection]
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at io.axoniq.axonserver.grpc.control.PlatformServiceGrpc$PlatformServiceBlockingStub.getPlatformServer(PlatformServiceGrpc.java:250)
	at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.connectChannel(AxonServerManagedChannel.java:115)
	at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.createConnection(AxonServerManagedChannel.java:319)
	at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.ensureConnected(AxonServerManagedChannel.java:299)
	at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.lambda$new$0(AxonServerManagedChannel.java:100)
	...
	at java.lang.Thread.run(Unknown Source)

We seem to have fix the gRPC connection reset issue by using the list of Axon servers (eg: axon-1.svc.cluster.local:8124,axon-2.svc.cluster.local:8124,axon-3.svc.cluster.local:8124) in the client’s axon.axonserver.servers, instead of the load-balanced service URL.

Good to hear this Alexandre!

This Axon blog post is a very good tutorial on deploying Axon Server EE in Kubernetes.

apiVersion: v1
kind: Service
metadata:
  name: axonserver-grpc
  labels:
    app: axonserver
spec:
  ports:
  - name: grpc
    port: 8124
    targetPort: 8124
  clusterIP: None
  selector:
    app: axonserver

The Service for the gRPC port has a defaulted type of “ClusterIP” with “clusterIP” set to “none”, making it (in Kubernetes terminology) a Headless Service. This is important because a StatefulSet needs at least one Headless Service to enable DNS exposure within the Kubernetes namespace. Additionally, client applications will use long-living connections to the gRPC port and are expected to be able to explicitly connect to a specific node. The client applications will be deployed in their own namespace and can connect to Axon Server using k8s internal DNS.

Client applications should list all the Axon Server nodes (DNS/URLs) to increase high availability and resiliency.

Load balancers between the client applications and Axon server nodes are not needed and they can cause issues (as you have experienced so far).

Best.