Hi Michael, hi Allard,
actually I can confirm that we’ve experienced the same behaviour. First on production and then we were able to replicate it locally.
Having 2 services running and axon server going down and up again causes one of the services unable to handle commands anymore even though the status of the connection seem okey.
The logs of a service claim:
`
name. status=Status{code=UNAVAILABLE, description=Unable to resolve host eventstore, cause=java.lang.RuntimeException: java.net.UnknownHostException: eventstore
service_1 | at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:420)
service_1 | at io.grpc.internal.DnsNameResolver$Resolve.resolveInternal(DnsNameResolver.java:256)
service_1 | at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:213)
service_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
service_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
service_1 | at java.base/java.lang.Thread.run(Thread.java:834)
service_1 | Caused by: java.net.UnknownHostException: eventstore
service_1 | at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
service_1 | at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)
service_1 | at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)
service_1 | at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)
service_1 | at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:640)
service_1 | at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:388)
service_1 | ... 5 more
service_1 | }
service_1 | 2020-02-14 06:17:24.962 WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: Unable to resolve host eventstore
service_1 | 2020-02-14 06:17:29.968 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting using unencrypted connection...
service_1 | 2020-02-14 06:17:29.980 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Requesting connection details from eventstore:8124
service_1 | 2020-02-14 06:17:29.999 WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: io exception
service_1 | 2020-02-14 06:17:35.000 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting using unencrypted connection...
service_1 | 2020-02-14 06:17:35.010 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Requesting connection details from eventstore:8124
service_1 | 2020-02-14 06:17:35.024 WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: io exception
service_1 | 2020-02-14 06:17:39.991 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting using unencrypted connection...
service_1 | 2020-02-14 06:17:40.002 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Requesting connection details from eventstore:8124
service_1 | 2020-02-14 06:17:40.370 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Reusing existing channel
service_1 | 2020-02-14 06:17:40.379 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Re-subscribing commands and queries
service_1 | 2020-02-14 06:17:40.387 INFO 7 --- [ectionManager-0] o.a.a.c.command.AxonServerCommandBus : Resubscribing Command handlers with AxonServer
service_1 | 2020-02-14 06:17:40.389 INFO 7 --- [ectionManager-0] o.a.a.c.command.AxonServerCommandBus : Creating new command stream subscriber
`
Axon Server Dashboard shows the service as connected.
But when the other service sends a command, the Axon Server prints:
`
eventstore_1 | 2020-02-14 06:35:48.122 WARN 7 --- [ool-5-thread-14] i.a.a.message.command.CommandDispatcher : No Handler for command: command.SampleCommand
`
Only restarting the service helps in that scenario.
When performing tests on my local machine I was able to reproduce this behaviour roughly in 60% of cases using newest Axon Framework 4.2.2 and Axon Server 4.2.4.
So far we didn’t introduce the heartbeat monitoring, it’s definitely something we would like to try, but we’re wondering if this will eventually get rid of this bug completely or only improve the probability of a successful reconnect?
Will update you on our findings.
Best regards,
Konrad Garlikowski
W dniu wtorek, 4 lutego 2020 15:39:25 UTC+1 użytkownik Allard Buijze napisał: