AxonServer 4.0.2 and 4.0.3 on Kubernetes

Vilmos_Rajcsanyi · February 2, 2019, 5:58pm

Hi,

I’m experiencing some strange behavior with AxonServer 4.0.2 and 4.0.3 on Kubernetes
(I’m using the official Docker image from https://hub.docker.com/r/axoniq/axonserver/ )

so far the 4.0 version has been working without any issues.
When switching to 4.0.3 (also tried 4.0.2) I started to experience out of memory errors
from AxonServer:

Exception in thread “event-stream-1” io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 469762327, max: 477626368) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 469762327, max: 477626368) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter

I noticed the startup script in the official image has a hardcoded -Xmx512m memory setting:
https://github.com/AxonIQ/axon-server-dockerfile/blob/master/src/main/docker/startup.sh

I guess it should be configurable - it doesn’t appear to be sufficient for the newer versions. Anyway, I built a custom image,
raised the memory limit, out of memory went away, but then I’ve noticed another thing.
This occurs during replay:

“Connecting to AxonServer node axonserver:8124 failed: UNAVAILABLE: Unable to resolve host axonserver”
[io.grpc.internal.ManagedChannelImpl-16] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host axonserver, cause=java.net.UnknownHostException: axonserver at java.net.InetAddress.getAllByName0(InetAddress.java:1281) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at io.grpc.internal.DnsNameResolver$JdkResolver.resolve(DnsNameResolver.java:497) at io.grpc.internal.DnsNameResolver$1.run(DnsNameResolver.java:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) }

(only a couple of times, and then it can reconnect)
I’ve got zero clues why this pops up, nothing has changed except for the axonserver version upgrade 4.0 -> 4.0.3.
I’ve never had such an issue with 4.0.

Also what I see, after replay, when I check the indexes of replayToken.getTokenAtReset() and replayToken.getCurrentToken(),
even after the replay finished and the event processor is caught up, currentToken’s global index remains at 46946
although tokenAtReset’s global index is 46962… it should have the same index, means replayed all of the events, right?
It has always been like that so far, what does this mean? Not everything is replayed?
I can’t see any eventprocessing errors though (in the logs of my projections) or any other exceptions for that matter… only
the ‘Unable to resolve host axonserver’ errors.

These problems only occur on GKE … running on my local machine it looks fine.
Is there something that has changed since 4.0 and I might have overlooked?
Meanwhile I reverted back to 4.0, things are back to normal, zero issues.

Thanks,
Regards

allardbz · February 4, 2019, 9:15am

Hi Vilmos,

how did you configure your limits on GKE? “The unable to resolve host” error might be an indication that K8s stopped a pod and was in the process of restarting it.
Note that the -Xmx does not have any influence on the OutOfDirectMemoryError. The latter occurs when Netty is at its limits of using off-heap memory. The former is a limit on the amount of Heap AxonServer may use. AxonServer uses quite a bit of off-heap memory for its processes (especially for buffering data for storage and networking), but doesn’t require too much heap.

Which AxonServer Connector version do you use?

Cheers,

Allard

Vilmos_Rajcsanyi · February 4, 2019, 11:46am

Hi Allard,

Thanks for the answer!

Some updates on the topic:

the replay and reset token index discrepancy is solved, I’ve dropped the mongo read DBs and tokenstores … no idea what could have gone
wrong there, something with the tracking tokens?
another thing I noticed once I do a replay, from that point the token of that processor is always going to be a replay token,
even after replay finished (mongo token store, 4.0.1 axon-mongo extension version, 4.0.3 version of everything else).
Is that supposed to be normal? This is at a normal startup of my projection, without any replay:

INFO 19368 — [rdProjection]-3] o.a.e.TrackingEventProcessor: Fetched token: ReplayToken{currentToken=IndexTrackingToken{globalIndex=46965}, tokenAtReset=IndexTrackingToken{globalIndex=46965}} for segment: Segment[2/3]

As a consequence trackerStatus.isReplaying() always returns true, even when not replaying anymore.
By the way, isCaughUp() always returns true, no matter what, here it is during replay:

I’m doing the replay like:

getProcessingGroups().forEach(processingGroup -> {
   eventProcessingConfiguration
         .eventProcessorByProcessingGroup(processingGroup, TrackingEventProcessor.class)
         .ifPresent(trackingEventProcessor -> {
            trackingEventProcessor.shutDown();
            trackingEventProcessor.resetTokens();
            trackingEventProcessor.start();
         });
});

and starting it right after the app startup:

@EventListener(ApplicationReadyEvent::class)
fun on() {
    if (needsReplay) {
        replayHandler.doReplay()
    }
}

Am I doing something wrong here?

regarding the connection errors I’ll play with the resource settings, did not set any limit to AxonServer, neither can I see a pod restart, but I’ll do some more testing, thanks!

Regards

allardbz · February 4, 2019, 4:05pm

Hi Vilmos,

the “replay” flag staying active with the token you provide is a known issue, which has been fixed and will be part of the next release. Note that the flag will be switched off as soon as a single event passes by that is not part of the replay. The “fix” is for the cases where the tokenAtReset and currentToken are equal, in which case the replay flag should also be disabled.

The “caught up” flag is set to “true” once the EventStream did not provide a next event for processing. I’ll have a look into this why there is this obvious false positive, here.

Cheers,

Allard