Axon Server crash when particular handlers connect

ViliusS · June 13, 2023, 8:58pm

Hello,

our Axon Server in one of the development environments have started crashing for no apparent reason. Whole JVM just collapses on itself. I have uploaded full JVM error log at default paste at 2023-06-13 20:53:59

We have pinpointed that the crash happens just right after one of the microservices connects to the Axon Server. Event handlers of the microservice starts to comeup online and then bam, the server is down.

I suspect there is a data corruption somewhere in the event store or in the jdbc token store. Any ideas how to debug and clean this up? And better yet, how to prevent this happening in the future?

Bert_Laverman · June 15, 2023, 6:54am

Hi Vilius, sorry to hear about this crash. Could you please add the logs of Axon Server and the microservice? This will show us what they were doing and any possible issues they saw. Also, it would be really helpful to know what versions of Axon Server and the Axon Framework (assuming the microservice is using it) were used.

Also, seeing a reference to “Distroless” and io.axoniq.axonserver.AxonServer in the crash log, I am assuming you’re running Axon Server SE in a container. If that is a container image from Docker Hub, can you provide the full tag for it? Are you running it in Kubernetes, Docker, or using some other platform? What resource limits (CPU and memory) were in effect?

Cheers,
Bert Laverman

ViliusS · June 20, 2023, 9:10pm

@Bert_Laverman sorry for delay. We bit the bullet and just cleared our Axon Server event store and jdbc Axon token store. This fixed the crash on staging environment.

However, today another testing environment started to crash with exactly the same behaviour. This time I noticed that the actually crash started during normal operations. Axon Server just started to crash every couple of minutes in cycles. I have noticed this couple hours later, so I tried to restart mentioned microservice again. Obviously this made things worse because the service cannot start now. Stopping the microservice stops the crashes on Axon Server side, so it looks like it is the same issue.

I have put logs at Bendrinimo saito tikrinimas

We are using Axon Server 4.6.11 and Framework 4.6.7. Axon Server is running in the container using official image from Docker Hub axoniq/axonserver:4.6.11 . We are running it in Kubernetes with the set limits of:

        resources:
          requests:
            memory: "1Gi"
            cpu: "200m"
          limits:
            memory: "4Gi"
            cpu: "750m"

Bert_Laverman · June 21, 2023, 6:42am

Ah, ok. We generally recommend starting with 8GiB of memory (2 for the Heap, 2 for “Direct Memory,” and the rest for applications and disk cache) and 2 vCPUs. Only reduce if monitoring shows it is really unused. Axon Server is a stateful infrastructure service, not a stateless microservice. Not having enough CPU resources, it may not respond promptly enough for k8s to consider it alive and healthy.

Bert

ViliusS · June 21, 2023, 7:02am

Since this is a separate container just for Axon Server and applications have their own limits 4GB of memory should be more than enough. Monitoring also shows that currently we only use ~1GB in total. The same for CPU, it hardly goes above 0.2 CPU any given time.

ViliusS · June 22, 2023, 1:08pm

Forgot to mention. Password for the logs is BadStreak

Bert_Laverman · June 26, 2023, 7:27am

Monitoring also shows that currently we only use ~1GB in total. The same for CPU, it hardly goes above 0.2 CPU any given time.

Yes, but when it does spike, through need, with such low limits Axon Server won’t have the resources to respond on its probe endpoints, let alone heartbeats to the other nodes in the cluster, and k8s-enforced restarts can result.

Again: Axon Server is not a microservice. Make sure it has enough resources to handle spikes in load, or k8s will think it is misbehaving and kill it.

Bert Laverman

ViliusS · June 27, 2023, 7:13pm

I completely understand that Axon Server is not microservice, however even during spikes which we measure every 5 seconds Axon Server doesn’t use too much CPU, and running ANY load on Kubernetes it is a best practice to have limits at some point. We give it 3-4x times more CPU limit than be biggest spike usage.

To tell you the truth, I would gladly run Axon Server on bare VM, however we still struggle to find a convenient way to automate such infrastructure configuration via code. Docker image is much more viable option to work with than, let’s say plain .jar file.

Nevertheless, I gave 2 full vCPUs now. Let’s see if the problem appears.

In the meantime, do you have insights in what could be happening with that Event store? Is our only best hope to try to upgrade to eclipse-temurin docker images instead of distroless we are using now, or is there another option we could try?

Bert_Laverman · June 28, 2023, 8:01am

Hey Vilius,
Without version information and logs, as I said in my first reply, I cannot say more. The Axon Server logging will show what it is doing, when client apps connect or disconnect, and if it encountered any errors. The fact that the error disappeared when you cleaned out the Event Store files suggests there was a potential problem in those files, but again, I would expect something about that to appear in the logging.

Bert

ViliusS · June 29, 2023, 8:10am

I have provided full logs of Axon Server and microservice as well version information in my replies above. Resending once more:

I have put logs at Bendrinimo saito tikrinimas
Password for the logs is BadStreak

We are using Axon Server 4.6.11 and Framework 4.6.7. Axon Server is running in the container using official image from Docker Hub axoniq/axonserver:4.6.11

Bert_Laverman · June 29, 2023, 2:26pm

Ok, I downloaded the files.

The Axon Server logs show it crashing pretty quickly after starting up, which is when you’d expect Axon Server to be pretty busy. Further, I see 20 events for the GC, which would be 10 collect cycles, which is pretty worrying for so short a run. The JVM parameters show it takes 50% for the heap, which leaves 2GiB for DirectMemory, the JVM itself, any disk I/O buffers, and the OS. Again, very tight.

Again, I suggest you set those limits at the recommended levels and see if you still have these problems with those settings.

Bert

ViliusS · June 30, 2023, 8:35am

So what are the recommended levels when running Axon Server in the container? I tried to find it in the documentation and asked couple of time in these forums. The last time I got response from someone at AxonIQ that 2GB of Heap and 2GB for DirectMemory and buffers (50% heap to off-heap ratio) should be enough to start with. When run in container JVM and OS takes couple of hundreds megabytes at most.

Bert_Laverman · June 30, 2023, 8:53am

Yes, in the Axon Server training those are the numbers. In the “Running Axon Server” GitHub repository, in the Kubernetes examples at 3-k8s/4-k8s-ee-ssts-tls/axonserver-sts.yml.tmpl, actually use:

        resources:
          limits:
            cpu: "2"
            memory: "12Gi"
          requests:
            cpu: "1"
            memory: "8Gi"

ViliusS · June 30, 2023, 9:30am

Could you elaborate on these parameters?

Previously you said:

According to Kubernetes settings above it leaves Axon Server with 3 GB of Heap (25% from 12 GB which is the default JVM heap ratio), and 9 GB for everything else. Isn’t this an overkill, considering that our monitoring measure 1-1,5GB total memory usage at all times?

Also what are those “applications” you mentioned? Isn’t Axon Server the only application in the container?

Bert_Laverman · June 30, 2023, 12:13pm

Well, first of all “we generally recommend” is not “for Kubernetes we recommend.” On a (physical or virtual) server there are always other apps. Second, it is better not to depend on some percentage of available memory. Axon Server does not use that much heap, so for SE 2GiB should be enough. If you don’t set an explicit limit on the JVM’s heap, it will consume more, resulting in unnecessarily large GC workloads.

But most importantly, if you are having a problem, and the JVM Error log shows that there were 10 GC runs due to a full heap, then it may be that later on less is enough, but certainly not during startup. With the recommended limits we have not seen such behavior. Problems due to too low CPU limits are also something we have seen, especially during startup. Not only is there a lot of checking work; verifying the Event Store files and indexes, also all clients clamor for attention, registering handlers, requesting replays, and so on. having all problems disappear if you throw away the Event Store is not necessarily file corruption, it also means there is less of this startup-pressure.

ViliusS · June 30, 2023, 12:59pm

I guess this makes sense. I have adjusted CPU to recommended levels already.

The problem I see is with memory configuration. JVM by default uses 25% of memory of what is totally available on the system for the heap. In the Kubernetes world, this means 25% of memory available for the Kubernetes node OR 25% of memory set in the container limits, if they are set. Generally, if you want to run multiple applications in the same Kubernetes cluster node and not only dedicate it to Axon Server as an example, then you must set limits on the containers of all these applications (otherwise you are risking running into resources overload/overlap and container eviction storms).

So if I would translate your physical/virtual VM 8 GB recommendation to Kubernetes, that would be 8-9 GB limit under default JVM settings, which leaves 2-2.25 GB for the heap and 6-6,75 GB for Direct Memory and everything else. Or I would set JVM heap ratio to 50% using JVM options, and then set the container limit to 4-5 GB, which will will leave Axon Servere with 2-2,5GB of heap and 2-2.5GB for Direct Memory and everything else.

I would prefer the second variant, because there is no other application in the container and OS/JVM consumption in container is very minimal (I have measured this multiple times, for multiple Java services). If you say that Axon Server memory usage is ~50/50% heap to off-heap ratio (2GB heap, 2GB direct memory), then 12 GB limit and 8GB request configuration IMHO is a total waste of resources as they will never be occupied by anything else. Even more so, I would say that Kubernetes or JVM configuration in the official Axon Server docker images should have heap ratio adjusted then.

Bert_Laverman · June 30, 2023, 2:43pm

Well, our experience is that you get the best and most predictable setup by not depending on defaults but rather using JVM parameters for heap and direct memory, giving each a 2GiB limit. You can use the JAVA_TOOL_OPTIONS environment variable for this.

ViliusS · July 2, 2023, 8:55pm

If you are referring to -Xmx and -XX:MaxDirectMemorySize properties then our current configuration does exactly that. The only difference is that we control memory size not via these JVM properties, but via -XX:MaxMemoryRatio and container limit combination. The limits are still explicit, but this way they are controlled by the container settings instead of JVM. And yes, currently they are 2GB + 2GB.

Bert_Laverman · July 3, 2023, 9:32am

Ok, if that way of configuring it works for you, then by all means, use it.

Just as a side note, have you considered running Axon Server in a VM instead of Kubernetes? Since you’re using a non-clustered instance of Axon Server, adding k8s into the setup increases your chances of having downtime. For example, when the Pod is vacated from the Worker Node because of a k8s upgrade, Axon Server will be down during the migration to another node. Using a VM gives you not just enhanced control over CPU, memory, and storage, but also less upgrade-related downtime. I’ll immediately agree deployment is easier on Kubernetes, but it kind of feels like a waste to have a single Worker node just for Axon Server, if I understand your earlier remarks correctly. Even a dedicated Docker/containerd setup would probably improve this.

ViliusS · July 4, 2023, 7:04am

To tell you the truth, I would gladly run Axon Server on bare VM, however we still struggle to find a convenient way to automate such infrastructure configuration via code. Docker image is much more viable option to work with than, let’s say plain .jar file.

Not sure if VM will be less work though. You still have to upgrade OS on VM and failing over to different VM is more difficult than failing over to different Kubernetes node during maintenance.

We run more application than just Axon Server on the same node, that’s why we have limits set on the containers of all these applications, so that we have complete control over CPU and memory resources.