we had in production that a service because of a db issue was not reconnecting to the axon server again. It was still visible on the axon dashboard but we had
NoHandlerForCommandException in our logs. restarting our service solved the problem. But I would expect that the framework handles this.
On prod we run 4.1.7 with framework 4.1.2 - On QA I already run 4.2.4 with axon framework 4.1.2.
I read in some notes about heartbeat in 4.2.2:
“Optional heartbeat between Axon Server and Axon Framework clients”
Would this improve this? Do we need to upgrade also the framework…
4.2 definitely has some improvements on reconnecting. There was a scenario where a reconnect would cause a deadlock as the framework was also attempting to resubscribe to the commands and queries. The application-level heartbeat also helps, as we found out that in certain scenarios, gRPC would not detect a connection is gone for a little while, unless it was used. Since Axon keeps connections open for an extended period of time, these heartbeats help detect failed connections earlier.
it has to be enabled on both the Server and Framework. Note that Server will only force disconnections with a client if it did receive at least one heartbeat. This allows for clients that don’t support heartbeats to still connect to the Server.
actually I can confirm that we’ve experienced the same behaviour. First on production and then we were able to replicate it locally.
Having 2 services running and axon server going down and up again causes one of the services unable to handle commands anymore even though the status of the connection seem okey.
The logs of a service claim:
`
name. status=Status{code=UNAVAILABLE, description=Unable to resolve host eventstore, cause=java.lang.RuntimeException: java.net.UnknownHostException: eventstore
service_1 | at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:420)
service_1 | at io.grpc.internal.DnsNameResolver$Resolve.resolveInternal(DnsNameResolver.java:256)
service_1 | at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:213)
service_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
service_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
service_1 | at java.base/java.lang.Thread.run(Thread.java:834)
service_1 | Caused by: java.net.UnknownHostException: eventstore
service_1 | at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
service_1 | at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)
service_1 | at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)
service_1 | at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)
service_1 | at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:640)
service_1 | at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:388)
service_1 | ... 5 more
service_1 | }
service_1 | 2020-02-14 06:17:24.962 WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: Unable to resolve host eventstore
service_1 | 2020-02-14 06:17:29.968 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting using unencrypted connection...
service_1 | 2020-02-14 06:17:29.980 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Requesting connection details from eventstore:8124
service_1 | 2020-02-14 06:17:29.999 WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: io exception
service_1 | 2020-02-14 06:17:35.000 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting using unencrypted connection...
service_1 | 2020-02-14 06:17:35.010 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Requesting connection details from eventstore:8124
service_1 | 2020-02-14 06:17:35.024 WARN 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting to AxonServer node [eventstore]:[8124] failed: UNAVAILABLE: io exception
service_1 | 2020-02-14 06:17:39.991 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Connecting using unencrypted connection...
service_1 | 2020-02-14 06:17:40.002 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Requesting connection details from eventstore:8124
service_1 | 2020-02-14 06:17:40.370 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Reusing existing channel
service_1 | 2020-02-14 06:17:40.379 INFO 7 --- [ectionManager-0] o.a.a.c.AxonServerConnectionManager : Re-subscribing commands and queries
service_1 | 2020-02-14 06:17:40.387 INFO 7 --- [ectionManager-0] o.a.a.c.command.AxonServerCommandBus : Resubscribing Command handlers with AxonServer
service_1 | 2020-02-14 06:17:40.389 INFO 7 --- [ectionManager-0] o.a.a.c.command.AxonServerCommandBus : Creating new command stream subscriber
`
Axon Server Dashboard shows the service as connected.
But when the other service sends a command, the Axon Server prints:
`
eventstore_1 | 2020-02-14 06:35:48.122 WARN 7 --- [ool-5-thread-14] i.a.a.message.command.CommandDispatcher : No Handler for command: command.SampleCommand
`
Only restarting the service helps in that scenario.
When performing tests on my local machine I was able to reproduce this behaviour roughly in 60% of cases using newest Axon Framework 4.2.2 and Axon Server 4.2.4.
So far we didn’t introduce the heartbeat monitoring, it’s definitely something we would like to try, but we’re wondering if this will eventually get rid of this bug completely or only improve the probability of a successful reconnect?
Will update you on our findings.
Best regards,
Konrad Garlikowski
W dniu wtorek, 4 lutego 2020 15:39:25 UTC+1 użytkownik Allard Buijze napisał:
we has the problem yesterday again - axon server run out of diskspace -> not all services reconnected properly! We where offline.
We needed to restart all services manually.
This was a real blocker for us. Can you if possible increase the prio of this?
we’ve put the issue on the backlog and our team is looking at how to fix this. Do note that if you’re using AxonServer in high demanding situations, we recommend using AxonServer Enterprise instead, as a single node’s failure will not result in outage.