Axon Server: Query scalability

Hi

My application consists of 5 client apps and 1 Axon Server running in Kubernetes. All clients run as 2 instances, Axon Server is a single instance Standard Edition. (Yes, I know a clustered enterprise edition would be better, but we are a startup:-).
All versions are latest, Java 17, Kotlin, ARM processors.

We make heavy use of queries, which works well from a functionality perspective so far, but we run into scalability (performance and stability) issues now.
I think I understand the Axon setup pretty well now (~2 years in production), but I’m still struggling with the right configuration for Axon Server and the clients to get good query throughput and stability.

We have query peaks at some times. There may be hundreds or thousands of queries started at some point in time. Most queries are fast, but there are also some slow queries (> 10 seconds). As data volume and system load increases, we get more and more issues with queries (stream errors from Axon Server, disconnected and restarted clients). It simply feels like the Axon query functionality does not scale (well). Maybe we have to redesign parts of our system because of this flaw/bug in Axon? Hopefully not!

Currently, I do not know how to approach these issues. I could not find meaningful hints in the logs and metrics provided by Axon Framework and Axon Server (I use the “standard” Grafana dashboards for Axon Server and clients).

  • What are the metrics I should look for?

Configuration

  • What are the config settings to play with?

Threads:

  • I know there is “query-threads” for server side.
  • But what about client side? Do I have to set query threads on both server and client side?
  • The docs do not tell much about those settings.

Message size:

  • I know that the message size is 4MB by default, and we already increased to 8MB because some queries could have large result payload.
  • But how is the max-message-size configured properly?
    • Is it required that all clients and the server use the same max-message-size?
    • What happens if client and server use different settings?
  • Nothing in the docs about this.

Hmm, does no one have an opinion on this topic?

About the query threads, setting this on the server side does not increase the throughput. Axon Server uses a single stream to send queries to a client that handles the queries. Increasing the threads sending messages to this queue causes more contention on this stream. You can increase the number of threads in the client application by setting the property axon.axonserver.query-threads. The default value for this property is 10.
For the message size is it important to configure the same value for the clients and for the server. If the client has a smaller message size that the server, a query handler could send a large response to Axon Server, but Axon Server would not be able to deliver the message to the requester. Also you would be able to store large events in the event store, but not be able to read them again.