AxonServer 4.4.1 doesn't seem to find old (4.2.4) snapshots

Hi!

Here’s the situation: we need to update from 4.2.4 to 4.4.1 (due to this issue: https://github.com/AxonIQ/axon-server-se/issues/105, 4.2.4 doesn’t start anymore).

When we start up AxonServer 4.4.1 on our current data with (/opt/axonserver/data is a Docker-mounted directory with all the files in it):

- AXONIQ_AXONSERVER_EVENT_STORAGE=/opt/axonserver/data
- AXONIQ_AXONSERVER_SNAPSHOT_STORAGE=/opt/axonserver/data
- AXONIQ_AXONSERVER_CONTROLDB-PATH=/opt/axonserver/data

It starts up just fine, no issues. But once we fire a new event on any of our Aggregates (which have many thousands events on them), AxonServer doesn’t seem to find our snapshots and starts rebuilding the aggregate from scratch.

I know this because I see in the logs that thousands of upcasters are executed (and also it takes a loooooooong time to process. In fact I don’t know how long because I have never let it finish and just terminted after some minutes).

So, any ideas why 4.4.1 doesn’t find our snapshots?
Or, can I somehow enable “debug logging for snapshot logic” to get some more insight into why it doesn’t load the snaphots?

Thanks, Martin

Hi Martin,
When you have Axon Server 4.4.1 running can you perform the following checks:

  • on the ://actuator/env URL and search for AXONIQ_AXONSERVER_SNAPSHOT_STORAGE
  • on the ://actuator/metrics/axon.snapshot.lastToken URL and check the value

If you are using docker-compose can you share the configuration for Axon Server?
I have tried to reproduce this locally and here it works as expected.

Marc

In the /env JSON I see:

{
    "name": "systemEnvironment",
    "properties": {
      "AXONIQ_AXONSERVER_SNAPSHOT_STORAGE": {
        "value": "/opt/axonserver/data",
        "origin": "System Environment Property \"AXONIQ_AXONSERVER_SNAPSHOT_STORAGE\""
      }
    }

So, that seems to work. But I also see:

{
    "name": "applicationConfig: [file:./axonserver.properties]",
    "properties": {
      "axoniq.axonserver.snapshot.storage": {
        "value": "/eventdata",
        "origin": "URL [file:./axonserver.properties]:12:36"
      }
  }

Maybe the system property does not correctly override the axonserver.properties property? :thinking:

The /axon.snapshot.lastToken gives me:

{
  "name": "axon.snapshot.lastToken",
  "description": "Last snapshot token in the event store",
  "baseUnit": null,
  "measurements": [{
    "statistic": "VALUE",
    "value": 140723.0
  }],
  "availableTags": [{
    "tag": "axonserver",
    "values": ["3d6e37ffe8c6"]
  }, {
    "tag": "context",
    "values": ["default"]
  }]
}

Which seems correct to me. The new snapshot file has the name: 00000000000000138647.snapshots.

So it seems to be loaded. But: Why isn’t it used then? :thinking:

Martin

Oh and here is the docker-compose that we use:

version: '3.7'
volumes:
  axon-data:
services:
  axon-server:
    image: axoniq/axonserver:4.4.1
#    image: axoniq/axonserver:4.2.4-jdk11
    network_mode: bridge
    expose:
      - 8024
      - 8124
    ports:
      # Uncomment this for local development
       - 127.0.0.1:8124:8124
       - 127.0.0.1:8024:8024
    volumes:
      - axon-data:/opt/axonserver/data
    environment:
      - AXONSERVER_TOKEN=##AXON_TOKEN##
      - AXONIQ_AXONSERVER_EVENT_STORAGE=/opt/axonserver/data
      - AXONIQ_AXONSERVER_SNAPSHOT_STORAGE=/opt/axonserver/data
      - AXONIQ_AXONSERVER_CONTROLDB_PATH=/opt/axonserver/data
      - AXONIQ_AXONSERVER_ACCESSCONTROL_TOKEN=##AXON_TOKEN##
      - AXONIQ_AXONSERVER_ACCESSCONTROL_ENABLED=true
    restart: always

So, from Axon Server side all seems to be working as expected. One additional thing you could do is to call http://localhost:8024/v1/events?aggregateId=, with an aggregate id that you know has a snapshot and check if it starts with a snapshot.

Do you have any snapshotFilters defined in the application?

So I managed to enable debug logging and found these lines:

2020-09-23 09:07:41.464 DEBUG 1 --- [           main] i.a.a.l.file.InputStreamEventSource      : Open file /opt/axonserver/data/default/00000000000027816162.events
2020-09-23 09:07:41.602  INFO 1 --- [           main] i.a.a.l.file.PrimaryEventStore           : Opening file /opt/axonserver/data/default/00000000000000138647.snapshots
2020-09-23 09:07:41.608 DEBUG 1 --- [           main] i.a.a.localstorage.file.Synchronizer     : Initializing at WritePosition{sequence=140724, position=5589898, segment=138647}
2020-09-23 09:07:41.609 DEBUG 1 --- [           main] i.a.a.localstorage.file.Synchronizer     : Scheduled syncJob
2020-09-23 09:07:41.609 DEBUG 1 --- [           main] i.a.a.localstorage.file.Synchronizer     : Scheduled forceJob
2020-09-23 09:07:41.609 DEBUG 1 --- [           main] i.a.a.l.file.StandardIndexManager        : default: open bloom filter for 0
2020-09-23 09:07:41.610 DEBUG 1 --- [           main] i.a.a.l.file.PersistedBloomFilter        : Opening bloom filter : /opt/axonserver/data/default/00000000000000000000.sbloom
2020-09-23 09:07:41.610 DEBUG 1 --- [           main] i.a.a.l.file.StandardIndexManager        : 0: open /opt/axonserver/data/default/00000000000000000000.sindex
2020-09-23 09:07:41.612 DEBUG 1 --- [           main] i.a.a.l.file.SegmentBasedEventStore      : First valid index: 0
2020-09-23 09:07:41.613 DEBUG 1 --- [nPool-worker-19] i.a.a.l.file.SegmentBasedEventStore      : default: Validating SNAPSHOT segment: 138647
2020-09-23 09:07:41.613 DEBUG 1 --- [           main] i.a.a.l.file.SegmentBasedEventStore      : default: Validating SNAPSHOT segment: 0
2020-09-23 09:07:41.613 DEBUG 1 --- [           main] i.a.a.l.file.SegmentBasedEventStore      : Get eventsource: 0
2020-09-23 09:07:41.613 DEBUG 1 --- [           main] i.a.a.l.file.InputStreamEventSource      : Open file /opt/axonserver/data/default/00000000000000000000.snapshots

Looks pretty good to me.

I tried with several different aggregates and found some strange behavior:
On most of the aggregates, snapshots seem to work. the event stream starts with a high sequence number and shows less then our snapshot limit events.
But on some aggregates it doesn’t seem to be working. It just loads and loads until the timeout is reached. So it doesn’t seem to be related to our application, or its upcasters or anything. But rather is a problem inside AxonServer with those aggregates.

Should I open another thread for this or should we continue here?

And no, I don’t think we have snapshotFilters defined (since I just heared about their existance from you :wink: ).

So here is a new insight:

As I already wrote in the opening post, we see a lot of upcasters being executed for those aggregates that are problematic. During this upcaster-marathon, we just saw this log entry:

org.axonframework.axonserver.connector.AxonServerException: OUT_OF_RANGE: [AXONIQ-2000] Invalid sequence number while storing snapshot

So, apparently, AxonServer doesn’t use the stored snaphots for these aggregates. Then it starts building the aggregate from scratch. And during this process, it, of course, tries to create snapshots. Which it cannot store (I guess, because there already are snapshots stored for these aggregates).

The corresponding code line is: https://github.com/AxonIQ/axon-server-se/blob/6eb98fbd79e88a0f6576b230cfeb98af71add342/axonserver/src/main/java/io/axoniq/axonserver/localstorage/LocalEventStore.java line 206.

Does this help in any way?

One more log I just found in the AxonServer debug log:

2020-09-23 10:23:53.030 DEBUG 1 --- [data-fetcher-18] i.a.a.l.file.StandardIndexManager        : default: lookupAggregate chargepoint_9d812e37-0669-4665-b76e-43c2de38dfb5 minSequenceNumber 0, lastSequenceNumber 9223372036854775807
2020-09-23 10:23:53.030  WARN 1 --- [data-fetcher-19] i.a.a.message.event.EventDispatcher      : appendSnapshot:  Error on connection from event store: [AXONIQ-2000] Invalid sequence number while storing snapshot

What’s up with that absurdly high lastSequenceNumber?`🤔
(just found out that it’s the max values of unsigned longs)

The large lastSequenceNumber is just to indicate that it has to find all events for the aggregate (not limiting to a specific sequence number). The Invalid sequence number message is happening when the sequence number in the snapshot is larger than the last sequence number for an event for the aggregate.
This can happen with Axon Framework versions before 4.3.4, where the snapshot could be stored before the events causing the snapshot are stored.

Alright, so that seems to be what happened to us. But how can we fix this now?

If you cannot upgrade the clients, you can switch of this check in Axon Server by setting the property:
axoniq.axonserver.check-sequence-nr-for-snapshots=false.
Using Docker this is easiest using environment variable
JAVA_TOOL_OPTIONS="-Daxoniq.axonserver.check-sequence-nr-for-snapshots=false"

Will try this. I just tried upgrading the clients which results in tons of issues :confused:

The one I’m currently facing last is:
The first like 10 requests go by just fine, but I start seeing this and every requests just hangs until timed out:

Failed to acquire lock for aggregate identifier(3e6f7029-786e-46cb-a546-a04111f67d70), maximum attempts exceeded (100))

It seems to be related to JPA/HikariPool. With the new Axon framework version it seems like this new version tries to open a new connection for every request it does. And the limit seems to be 10. Any idea how or why this might happen?