Better reconnect of services

Michael_Dempfle1 · February 4, 2020, 1:36pm

Hi,

we had in production that a service because of a db issue was not reconnecting to the axon server again. It was still visible on the axon dashboard but we had
NoHandlerForCommandException in our logs. restarting our service solved the problem. But I would expect that the framework handles this.

On prod we run 4.1.7 with framework 4.1.2 - On QA I already run 4.2.4 with axon framework 4.1.2.

I read in some notes about heartbeat in 4.2.2:
“Optional heartbeat between Axon Server and Axon Framework clients”
Would this improve this? Do we need to upgrade also the framework…

What can we do to improve this?

Best, Michael

allardbz · February 4, 2020, 2:03pm

Hi Michael,

4.2 definitely has some improvements on reconnecting. There was a scenario where a reconnect would cause a deadlock as the framework was also attempting to resubscribe to the commands and queries. The application-level heartbeat also helps, as we found out that in certain scenarios, gRPC would not detect a connection is gone for a little while, unless it was used. Since Axon keeps connections open for an extended period of time, these heartbeats help detect failed connections earlier.

Kind regards,

Michael_Dempfle1 · February 4, 2020, 2:16pm

Hi Allard,

thanks for the quick reply. But do we also have to upgrade the framework to 4.2.x to use this heartbeat? And how can we configure it?

Best, Michael

allardbz · February 4, 2020, 2:39pm

Hi Michael,

it has to be enabled on both the Server and Framework. Note that Server will only force disconnections with a client if it did receive at least one heartbeat. This allows for clients that don’t support heartbeats to still connect to the Server.

See https://docs.axoniq.io/reference-guide/operations-guide/setting-up-axon-server/heartbeat-monitoring for details.

Cheers,

Konrad_Garlikowski · February 14, 2020, 12:37pm

Hi Michael, hi Allard,

actually I can confirm that we’ve experienced the same behaviour. First on production and then we were able to replicate it locally.

Having 2 services running and axon server going down and up again causes one of the services unable to handle commands anymore even though the status of the connection seem okey.

The logs of a service claim:

`

name. status=Status{code=UNAVAILABLE, description=Unable to resolve host eventstore, cause=java.lang.RuntimeException: java.net.UnknownHostException: eventstore
service_1  |    at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:420)
service_1  |    at io.grpc.internal.DnsNameResolver$Resolve.resolveInternal(DnsNameResolver.java:256)
service_1  |    at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:213)
service_1  |    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
service_1  |    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
service_1  |    at java.base/java.lang.Thread.run(Thread.java:834)
service_1  | Caused by: java.net.UnknownHostException: eventstore
service_1  |    at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
service_1  |    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)
service_1  |    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)
service_1  |    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)
service_1  |    at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:640)
service_1  |    at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:388)
service_1  |    ... 5 more
service_1  | }
service_1  | 2020-02-14 06:17:24.962  WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: Unable to resolve host eventstore
service_1  | 2020-02-14 06:17:29.968  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Connecting using unencrypted connection...
service_1  | 2020-02-14 06:17:29.980  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Requesting connection details from eventstore:8124
service_1  | 2020-02-14 06:17:29.999  WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: io exception
service_1  | 2020-02-14 06:17:35.000  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Connecting using unencrypted connection...
service_1  | 2020-02-14 06:17:35.010  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Requesting connection details from eventstore:8124
service_1  | 2020-02-14 06:17:35.024  WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: io exception
service_1  | 2020-02-14 06:17:39.991  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Connecting using unencrypted connection...
service_1  | 2020-02-14 06:17:40.002  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Requesting connection details from eventstore:8124
service_1  | 2020-02-14 06:17:40.370  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Reusing existing channel
service_1  | 2020-02-14 06:17:40.379  INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager      : Re-subscribing commands and queries
service_1  | 2020-02-14 06:17:40.387  INFO 7 --- [ectionManager-0] o.a.a.c.command.AxonServerCommandBus     : Resubscribing Command handlers with AxonServer
service_1  | 2020-02-14 06:17:40.389  INFO 7 --- [ectionManager-0] o.a.a.c.command.AxonServerCommandBus     : Creating new command stream subscriber

`

Axon Server Dashboard shows the service as connected.

But when the other service sends a command, the Axon Server prints:

`

eventstore_1      | 2020-02-14 06:35:48.122  WARN 7 --- [ool-5-thread-14] i.a.a.message.command.CommandDispatcher  : No Handler for command: command.SampleCommand

`

Only restarting the service helps in that scenario.

When performing tests on my local machine I was able to reproduce this behaviour roughly in 60% of cases using newest Axon Framework 4.2.2 and Axon Server 4.2.4.

So far we didn’t introduce the heartbeat monitoring, it’s definitely something we would like to try, but we’re wondering if this will eventually get rid of this bug completely or only improve the probability of a successful reconnect?

Will update you on our findings.

Best regards,
Konrad Garlikowski

W dniu wtorek, 4 lutego 2020 15:39:25 UTC+1 użytkownik Allard Buijze napisał:

Konrad_Garlikowski · February 19, 2020, 8:58am

Hello again,

unfortunately heartbeat didn’t help in our case. I’ve created an issue with the description: https://github.com/AxonFramework/AxonFramework/issues/1350

Best regards,
Konrad Garlikowski

W dniu piątek, 14 lutego 2020 13:37:02 UTC+1 użytkownik Konrad Garlikowski napisał:

Michael_Dempfle1 · February 20, 2020, 10:02am

Thanks,

we has the problem yesterday again - axon server run out of diskspace -> not all services reconnected properly! We where offline.
We needed to restart all services manually.

This was a real blocker for us. Can you if possible increase the prio of this?

allardbz · March 1, 2020, 3:38pm

Hi Michael,

we’ve put the issue on the backlog and our team is looking at how to fix this. Do note that if you’re using AxonServer in high demanding situations, we recommend using AxonServer Enterprise instead, as a single node’s failure will not result in outage.

Kind regards,

Michael_Dempfle1 · March 2, 2020, 8:46am

Thanks,

we upgrade to the Enterprise version right now.
Our tests will contain that we kill one node but also all 3 nodes…

Hopefully when killing one node this is not an issue anymore. So the problematic scenario does not happen that often anymore.

One more thing. If we update Axon -> can this be done in the enterprise version nove by node or do we need to take down all 3 of them for this?

4.3 is ready as I see…

Best, Michael

allardbz · April 5, 2020, 10:01am

Hi Michael,

you can upgrade nodes one by one, as long as you upgrade one minor version at a time. We guarantee that any version x.y is compatible with x.[y+1].

Cheers,