I have a problem with a “hanging” application node which blocks further query processing. We are using Axon Framework v4.5.2 and Axon Server SE version 4.5.2.
Scenario is as follows: 1 frontend node, 1 backend node running.
Frontend app sends a lot commands and queries to Axon Server and backend nodes handle those commands and queries properly. Sometimes we run into a situation, that no more queries are processed at all and frontend app gets stuck, because to more data can be retrieved via queries.
Adding a new node does not help: It is registered properly in Axon Server, you can see it in the AS UI. But queries still are not answered. If the first backend node is stopped, queries get answered from the new node and system behaves normally again. But it looks like the system would not recover without killing/restarting the “hanging” backend process.
We do not see any obvious error messages. The only suspicious message from the Axon Server log is:
WARN 1 — [MessageBroker-4] i.a.axonserver.message.query.QueryCache : Found 5 waiting queries to delete
WARN 1 — [MessageBroker-4] i.a.axonserver.message.query.QueryCache : Cancelling query …
The only assumption is that we run too many queries in a short period of time. But this should not break the system in such a way, so that it does not recover.
Does anybody have any hints, what the problem may be?
How can we analyze the problem?