Hanging system: No more queries processed

klauss42 · June 30, 2021, 4:16pm

Hi
I have a problem with a “hanging” application node which blocks further query processing. We are using Axon Framework v4.5.2 and Axon Server SE version 4.5.2.

Scenario is as follows: 1 frontend node, 1 backend node running.
Frontend app sends a lot commands and queries to Axon Server and backend nodes handle those commands and queries properly. Sometimes we run into a situation, that no more queries are processed at all and frontend app gets stuck, because to more data can be retrieved via queries.

Adding a new node does not help: It is registered properly in Axon Server, you can see it in the AS UI. But queries still are not answered. If the first backend node is stopped, queries get answered from the new node and system behaves normally again. But it looks like the system would not recover without killing/restarting the “hanging” backend process.

We do not see any obvious error messages. The only suspicious message from the Axon Server log is:

WARN 1 — [MessageBroker-4] i.a.axonserver.message.query.QueryCache : Found 5 waiting queries to delete
WARN 1 — [MessageBroker-4] i.a.axonserver.message.query.QueryCache : Cancelling query …
…

The only assumption is that we run too many queries in a short period of time. But this should not break the system in such a way, so that it does not recover.

Does anybody have any hints, what the problem may be?
How can we analyze the problem?

Thanks
Klaus

Marc_Gathier · July 1, 2021, 9:40am

When this happens again, can you check the /actuator/health page for Axon Server. This contains a section “query”, that shows the number of queries queued on Axon Server per client application (should normally be 0,) and the number of remaining permits (typically a value between 2500 and 5000).

klauss42 · July 1, 2021, 9:44am

Thanks, this is a good hint and I will check, if this happens again.
What can we do if there are queries queued? Can or should we increase some Axon Server system property?

Marc_Gathier · July 1, 2021, 2:04pm

If the queries are queued it means that the handler is slower than the producer. You can either select to have more instances of the handler, in which case Axon Server will distribute the queries between the instances, or check if it makes sense to configure more threads for query handling.
To configure more threads for query handling in the application configure the property axon.axonserver.query-threads, default value for this is 10. Note that you may have to increase the size of your database connection pool as well (spring.datasource.hikari.maximum-pool-size).

klauss42 · July 5, 2021, 8:48am

Hi again
I still get this problem and I could not see any queries queued in Axon Server, at least the response from health endpoint does not look as if there are queries queued up:

    "query": {
      "status": "UP",
      "details": {
        "1@qa-backend-6d9cbbf9b7-4phvz.0fa75173-e3ad-4f28-956b-efdf39826ac0.default.waitingQueries": 0,
        "1@qa-backend-6d9cbbf9b7-4phvz.0fa75173-e3ad-4f28-956b-efdf39826ac0.default.permits": 4218,
        "1@qa-backend-6d9cbbf9b7-972p8.99a0b0a2-3601-4aae-be54-5b9cef9d99c1.default.waitingQueries": 0,
        "1@qa-backend-6d9cbbf9b7-972p8.99a0b0a2-3601-4aae-be54-5b9cef9d99c1.default.permits": 4221,
        "1@qa-backend-6d9cbbf9b7-4x2q7.064041dc-96d0-4876-a8e3-7c5d05ed7e40.default.waitingQueries": 0,
        "1@qa-backend-6d9cbbf9b7-4x2q7.064041dc-96d0-4876-a8e3-7c5d05ed7e40.default.permits": 4951
      }
    },

At the moment I captured above health info the 2 backend nodes 4phvz + 972p8 are registered properly in Axons server but do not respond to queries. I do not see any errors in the logs of these nodes, they simply do not get requests from Axons Server. After restarting the 2 nodes, queries are handled again.

This is a really bad behavior as it looks like Axon Server is not reliable and we doo not get any hints what the problem may be.

Any help is appreciated
Klaus

Marc_Gathier · July 15, 2021, 12:40pm

Do you use subscription queries in your application? We have recently discovered an issue that when creating a few thousands of subscription queries could block query processing. This is fixed in the axonserver-connector-java version 4.5.2, which is part of the Axon Framework version 4.5.3 which will be released shortly. Until this release of FW is available you can add a dependency for the new axonserver-connector-java in your project.

klauss42 · July 15, 2021, 1:21pm

Hi Marc
thanks for the follow-up on this topic.
Yes, we have quite some subscription queries, but we don’t have thousands. We have suspected the subscription queries for a long time already, but could not see a real correlation. I have setup a lot of monitoring in Grafana to check if any kind of metric (Axon Server, JVM, Boot app) shows some hint for the “freeze”, but could not see anything meaningful so far.

So this is good news that there may be a fix and we will check the new version and see if stability improves.

Thanks
Klaus

klauss42 · July 29, 2021, 12:08pm

Hi
just to give a feedback:
It seems that applying version 4.5.3 fixed the behavior I described in this thread. At least we did not encounter any “hanging queries”.

Thanks a lot
Klaus

lfgcampos · July 30, 2021, 8:32pm

Good news @klauss42, thanks for letting us know.

In this case, I’ve marked Marc’s response as the solution to the thread.

KR,