Trying to set up a clone of our Axon Server on local machine for development

0xFFFF_FFFF · June 7, 2023, 3:38am

Hi there, I’m trying to run a clone of our shared environment’s Axon Server on my local machine so that I can test out new features without impacting other team members.

Our shared environment server is an enterprise edition 3-node cluster running in Kubernetes, but I only want to run a single server / node on my local machine. To that end, I’m using “docker-compose” to run the following 9 Docker containers:

postgres
Axon Server EE
(our 7 client apps)

I’ve read and followed all of the instructions on the Backups and Recovery pages, namely:

Created a backup of the control database, unzipped it, and copied its contents into the Docker container at:
- /axonserver/data/axonserver-controldb.mv.db
queried the “events” and “snapshots” filenames, and copied these files into the Docker container at:
- /axonserver/events/our-app-context/00000000000000000000.events
- /axonserver/events/our-app-context/00000000000000000000.snapshots
and queried the log file names, and copied the single file that it outputs (even though the URL name is plural, and I can see more *.log files inside the Docker container in our shared environment?) into the Docker container at:
- /axonserver/log/default/00000000000000000001.log

Next, I created a cluster-template.yaml file containing the configuration that I wish to run on my local machine:

axoniq:
  axonserver:
    cluster-template:
      first: ${LOCAL_AXONSERVER}

      users:
        - roles:
            - context: _admin
              roles:
                - ADMIN
            - context: our-app-context
              roles:
                - USE_CONTEXT
          password: @dmin
          userName: admin

      replicationGroups:
        - roles:
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}
          name: _admin
          contexts:
            - name: _admin
              metaData:
                event.index-format: JUMP_SKIP_INDEX
                snapshot.index-format: JUMP_SKIP_INDEX
        - roles:
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}
          name: default
          contexts:
            - name: our-app-context
              metaData:
                event.index-format: JUMP_SKIP_INDEX
                snapshot.index-format: JUMP_SKIP_INDEX

      applications:
        - token: ${CLIENT_APP_TOKEN}
          name: client-app-number-1
          roles:
            - roles:
                - USE_CONTEXT
              context: our-app-context
          description: ""
        // etc. for the remaining 6 apps

But as you can probably already see by now, I run into a dilemma:

Approach 1: include control DB:

If I copy “axonserver-controldb.mv.db” into the Axon Docker container, I get the following message in the console:

Current node name has changed, new name axonserver. Start AxonServer with recovery file.

But then if I include a recovery.json file:

[
  {
    "name": "axonserver1",
    "oldName": "axonserver-0-0",
    "hostName": "axonserver",
    "internalHostName": "axonserver",
    "internalGrpcPort": 8224,
    "httpPort": 8024,
    "grpcPort": 8124
  }
]

…and add axoniq.axonserver.recoveryfile=/axonserver/config/recovery.json into my “axonserver.properties” file, then at least it successfully renames my node, but then I get the following error:

Unknown host: axonserver-1-0.axonserver-svc.axonserver-ee.svc.cluster.local

…which appears like Axon Server is still looking for the other 2 nodes, which the control database is telling it should be there. But as I said at the start, I don’t want to run all 3 nodes of the cluster on my machine, just 1 of them. And so I assume that I don’t want to go down this path… (am I correct?)

Which leads me to:

Approach 2: only copy over events & snapshots:

With this approach, Axon Server starts up fine, and I can open the dashboard at http://localhost:8024/#query and see my events. But then once my first Spring Boot client application starts up, it begins sourcing the events, and I get a bunch of these errors:

2023-06-07 03:09:35.162 (Application trying to apply various events)
2023-06-07 03:10:05.373 Error occurred. Starting retry mode.
java.lang.IllegalStateException: The UnitOfWork is in an incompatible phase: NOT_STARTED
	at org.axonframework.common.Assert.state(Assert.java:44)
	at org.axonframework.messaging.unitofwork.AbstractUnitOfWork.rollback(AbstractUnitOfWork.java:123)
	at org.axonframework.messaging.unitofwork.UnitOfWork.attachTransaction(UnitOfWork.java:276)
	at org.axonframework.eventhandling.TrackingEventProcessor.processBatch(TrackingEventProcessor.java:459)
  ...
	at java.base/java.lang.Thread.run(Thread.java:832)
2023-06-07 03:10:05.374 Releasing claim on token and preparing for retry in 1s
2023-06-07 03:10:05.529 Error:
io.axoniq.dataprotection.api.DataException: ADPM-5010. SQL Exception.
	at io.axoniq.dataprotection.internal.y.G.d(uk:311)
	at io.axoniq.dataprotection.cryptoengine.JdbcCryptoEngine.getKey(jka:103)
  ...
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30008ms.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696)
  ...
	at io.axoniq.dataprotection.cryptoengine.JdbcCryptoEngine.getKey(jka:51)
	... 58 common frames omitted
  
// more of the same

Am I going about this the right way? Is the problem in our app’s code, or how I’ve configured my server, or both, or something else?

Bert_Laverman · June 7, 2023, 7:55am

Hi there!

I see you have tried several things, but let me start with a few questions:

Is your goal to test new features in Axon Server, the Axon Framework, or your own application?
Just to make sure; is the “shared environment” the production environment or a non-critical pre-production/qa/test environment?
Do you really need all the data from production (the shared environment) on your dev setup?
Do you need a highly available cluster of Axon Server nodes or “just Axon Server”?

I ask this because I think you may have made your setup a bit more complex than needed.

About the different files you copied over:

The control database contains all the configuration data for the cluster, including application registrations (and hashed tokens) and user accounts. (including hashed passwords) If you are building a laptop-local environment just to run tests, you will definitely run into the DNS configuration issues you noted, but you will also have to use production tokens and passwords, which is normally not a good idea.
You did not mention this, but there are actually two backup endpoints for the Event Store, one of which does not include the active segment. This is described on the page you refer to. You will miss some of the newer events if you leave out the active segment, but if you are just trying out new features that may not be a problem.
The replication log contains events that are being sent between the nodes in the cluster. You normally only copy those if you aim to create a mirror image of the data. Again, if you are just trying to get an environment for testing, you can leave it out. Additionally, since they are used for synchronizing the data between nodes, be advised that in an active environment, it may turn out to be difficult to get a coherent set across all nodes.
The data in the replication log is cleaned according to a configurable schedule. Older segments for which the data has already been replicated and committed, but which have not yet been cleaned, can be safely left out. I suspect this is the reason for the discrepancy between the provided list and the files you actually found.

Using a recovery file only to rename node 1 will indeed still cause problems for the other nodes. As I said above, the control database contains the configuration data for the entire cluster so that the nodes will use the DNS names for the shared environment.

Finally, the errors you are getting, appear to refer to the key store for the Data Protection module being unavailable. Is that perhaps also a DNS renaming issue?

0xFFFF_FFFF · June 7, 2023, 4:33pm

Hi Bert, thanks so much for the quick reply! To answer your questions:

My main goal is to test out replaying events (and possibly, event upcasting) without worrying about impacting others’ use of our shared environment. But I also like the idea of being able to test out any future development locally first.
The shared environment in question is our QA / Test environment.
I’m open to any suggestions here, but I assumed that with concepts like event replay & upcasting, that these carried a higher risk of “screwing something up”, and so I assumed I should be testing these out locally first. In theory I don’t need all of the data from our shared environment in my local setup, but enough to be able to test replays & upcasting.
Just Axon Server. If possible, I’d like to run just a single Axon Server (EE) instance, and not a cluster. But the only way I could see to export our QA env’s Axon Server configuration was the “Export cluster-template.yaml” button in the Settings page in Dashboard. And I thought running a 1-node cluster would be close enough. But I guess I could also try running a regular single server instance, and manually configuring it (i.e., users, contexts, and apps) to match our QA env… I will try this and report back, but I wonder if I’ll still run into the same problems outlined above?

If there’s a simpler way to achieve my goal here, I’m all ears.

Also, thanks for clarifying the purpose of each backup file. So it sounds like for my situation, I only need to worry about copying over the “.events” and “.snapshots” files, is that correct?

Regarding those stack traces I shared, thanks for pointing out they were related to the Data Protection module keys. I see now from the setup steps article for Data Protection that we need a database table in which to store the cryptographic keys. I can see this table in our QA environment (ours is located at "public.data_protection_keys"), and so I exported all of its records into an equivalent table in my local Postgres, and tried re-running my first client Spring Boot application, but unfortunately I get the same errors (i.e., The UnitOfWork is in an incompatible phase: NOT_STARTED). Edit: Never mind, I just realised that not having a key in the table shouldn’t generate an error, as this would be akin to crypto shredding.

Are there some further Data Protection configurations that I need to apply for this to work, or am I going down the wrong path here?

0xFFFF_FFFF · June 8, 2023, 5:45pm

So given that the only scenario covered in the documentation is backing up and restoring an Axon Server configuration exactly like it is, in a new environment, I tried a new approach instead:

Approach 3: run a full 3-node cluster locally w/ control DB, events, & snapshots:

I am now running 3 Axon Server nodes as individual Docker services, changing their hostnames with the following recovery.json file:

[
  {
    "name": "axonserver-1",
    "oldName": "axonserver-0-0",
    "hostName": "axonserver-1",
    "internalHostName": "axonserver-1",
    "internalGrpcPort": 8224,
    "httpPort": 8024,
    "grpcPort": 8124
  },
  {
    "name": "axonserver-2",
    "oldName": "axonserver-1-0",
    "hostName": "axonserver-2",
    "internalHostName": "axonserver-2",
    "internalGrpcPort": 8224,
    "httpPort": 8024,
    "grpcPort": 8124
  },
  {
    "name": "axonserver-3",
    "oldName": "axonserver-2-0",
    "hostName": "axonserver-3",
    "internalHostName": "axonserver-3",
    "internalGrpcPort": 8224,
    "httpPort": 8024,
    "grpcPort": 8124
  }
]

And with the following changes to cluster-template.yml:

      # ... rest of file

      replicationGroups:
        - roles:
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}-1
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}-2
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}-3
          name: _admin
          contexts:
            - name: _admin
              metaData:
                event.index-format: JUMP_SKIP_INDEX
                snapshot.index-format: JUMP_SKIP_INDEX
        - roles:
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}-2
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}-1
            - role: PRIMARY
              node: ${LOCAL_AXONSERVER}-3
          name: default
          contexts:
            - name: our-app-context
              metaData:
                event.index-format: JUMP_SKIP_INDEX
                snapshot.index-format: JUMP_SKIP_INDEX

      # ...

And this part works fine. All 3 nodes come online with new hostnames, a leader is elected, etc. And then when I launch my first client application, it connects successfully directly to the leader, then indirectly to the other 2 nodes.

And so if my understanding is correct, and if I performed all of the documented steps correctly, then my local setup should pretty much be an exact clone of our QA environment, right? And so this should all theoretically work, right?

But unfortunately, shortly after launching my first client app, I still get those same errors as before, i.e.:

2023-06-08 17:48:28.104 Error occurred. Starting retry mode.
java.lang.IllegalStateException: The UnitOfWork is in an incompatible phase: NOT_STARTED

I can’t find much about these errors online. What do they mean, exactly? Should I just ignore them, and set my data source timeout value to a lower value so that it doesn’t block for 30 seconds each time?

Bert_Laverman · June 9, 2023, 7:45am

Ok, I think we need someone with a bit more knowledge of the Framework. It could be related to a replay trying to continue a transaction that is not started or a saved Saga state.

Bert_Laverman · June 9, 2023, 8:14am

And I got a reply from a colleague: The first stacktrace showed the db connection error as the root cause. This happens after 30 seconds of trying to connect. Most likely the database container takes (too) long to start up and the framework app tries to access it before the db is ready.

0xFFFF_FFFF · June 15, 2023, 3:36pm

Hi Bert, thanks for your replies. I have some new information to share, and apologies if this is no longer related to my original question (but in my mind, it all falls under the umbrella of setting up a local test environment ):

Regarding the errors above, it seems that our Hikari connection pool is getting exhausted. The default max pool size is 10, and in the logs I can see my application creating 10 DB connections. And even if I increase Hikari’s max pool size from the default of 10 to 100, the app will simply create 100 DB connections instead of 10, and go back to throwing the same errors. One of my colleagues steered me towards the documentation for event streaming which states:

Streaming Processors use separate threads to process the events retrieved from the StreamableMessageSource. Using separate threads decouples the StreamingEventProcessor from other operations (e.g., event publication or command handling), allowing for cleaner separation within any application.

Is this threading / parallelisation of streaming the culprit behind all of these database connections that I’m seeing, or have we configured something wrong in our Axon Server / Hikari? If it’s the former, how do we limit the number of threads (and by extension, database connections) created by Axon during event streaming?

Edit: The above might no longer be relevant, but I left it there just in case.

As a troubleshooting step, this afternoon I increased our Hikari database idleTimeout, connectionTimeout, and leakDetectionThreshold values, and the result is that while I no longer see any more SQLTransientConnectionException exceptions, I’m now seeing the following errors in our client application logs:

Aggregate cannot handle command [com.[…]$MyCommand], as there is no entity instance within the aggregate to forward it to.

Command ‘com.[…].commands$MyCommand’ resulted in org.axonframework.commandhandling.CommandExecutionException(Failed to acquire lock for identifier(dd428d7a-9537-4895-971e-f179b123a20e), maximum attempts exceeded (6000))

Command ‘com.[…].commands$MyCommand’ resulted in org.axonframework.commandhandling.CommandExecutionException(Cancelled by AxonServer due to timeout)

I see that are a few forum posts on here containing these error messages, so I will look into those tomorrow. But any guesses as to what the root cause is?

My situation is: our QA environment has been up and running for 2–3 years before I joined the company, and now I’m just trying to run it on my local machine for the first time, containing the exact same events that are in QA. So the idea here is that I’m not adding anything that isn’t already existing (and presumably, already working) in our QA environment.

Bert_Laverman · June 16, 2023, 7:19am

Ok.
To start, I have to admit I’m not an expert in Streaming processor configuration, but I do know that there is a big difference in behavior between the TrackingEventProcessor and the PooledStreamingEventProcessor. I think starting the questions on this in a fresh thread (with a matching subject) can help draw attention. Also, if I may hazard a guess, you’re also discussing this on the Slack support channel and Allard has posted a reply. I suggest you continue there.

Cheers,
Bert

0xFFFF_FFFF · June 28, 2023, 9:45pm

By the way, I finally figured out the problem after all!

The fix: I had to manually export & import all of the records from our QA environment’s "public".tokenentry table into the same table in my local Postgres database.

Explanation: As I understand it, and please correct me if I’m wrong:

as I mentioned previously, my local environment contains all of the events from our company’s QA environment
and — as with any local “first startup”, regardless of what’s in the event store — when Axon client apps start up for the first time, they look inside the "public".tokenentry table to get their bearings
and because my local table was empty, my client apps wouldn’t find any records, and therefore would start processing all of the events in the event store from the beginning, and save a new token record inside this table to track their progress
interestingly, my client apps would then process the events through all of their processor streams, including — importantly, in our case — any processor groups marked as @DisallowReplay, of which we had many, all of which were decorated thusly because they contain event handlers that dispatch new commands (which we don’t want executing again, as we already have the events that got applied from those commands in our event store)
and so, because Axon Framework was processing all events through these “forbidden processing streams”, this would trigger a bunch of new commands getting dispatched, which would then result in a bunch of errors and a bunch of new, “duplicate” events being applied

Looking back, I think the biggest clue that everyone overlooked was: why are my client apps dispatching commands on their first startup?? (They should be processing events, and nothing more)

After having said all of that, I’m left with two final questions:

Question #1: Could I request the online documentation be updated to include this crucial step (i.e.: exporting & importing records from "public".tokenentry) when setting up a clone of an existing Axon environment? I does not work without this step (unless you have a codebase without any @DisallowReplay code!).

Question #2: Initially I was surprised to find that the Axon Framework was executing event handlers annotated with @DisallowReplay, but after doing some tests, I realised that this is likely by design / outside of Axon Framework’s control in a situation like this, because it doesn’t know that it is “replaying events”, it just thinks these events are coming in for the first time. Is this correct? Or, putting it another way, I guess that NOT having a tracking token for a given processor stored inside the "public".tokenentry table (or deleting it from the table), is akin to some kind of “super reset”, then? All the more reason, in my mind, to include in the online documentation the vital step of exporting & importing the "public".tokenentry records into the new database when cloning / migrating environments.

Bert_Laverman · June 29, 2023, 7:46am

Congrats on finding the real problem and solution! When I discussed this with a colleague from the Framework team, he confirmed this behavior, which is actually described on the page concerning Streaming Event Processors. I guess we could use a top-level page on backing up and restoring Axon Framework applications, that links to all the relevant bits for the different details.

So on “Question #1” the answer is: can you please indicate what pages you did consult so we can verify they’re not missing important facts on them? If you feel like it, feel free to use your experience to suggest content for this new top-level page.

As for “Question #2”: we think this is sufficiently explained on the page I linked above. Can you provide us with suggestions on how we could improve it?

Cheers,
Bert

0xFFFF_FFFF · June 29, 2023, 1:51pm

Hi Bert, thanks for linking to that section of the SEP docs. I’m sure that I must’ve looked over that section at some point during the past couple of weeks, but I probably didn’t understand what I was reading at the time. It all makes a lot more sense to me now, having “lived” the experience for myself.

I think a top-level page on backing up and restoring (or perhaps, “cloning”?) Axon Framework applications would be a good idea. And yes, I would be happy to suggest content for such a page. Should I fork and create a PR to the “reference-guide” repo?

Regarding which pages I consulted during my task, most of the relevant information that I ended up using came from the Backups page (I did consult the Recovery page, but I ended up not needing these steps for my particular task). As you know, my main objective was less about backing up & recovering an Axon environment (we already have Kubernetes backups for that in both QA and Prod), and more about cloning an Axon environment to use for development purposes. I guess another way of putting it is that backing up & recovery is a superset of the steps that I used. So putting that all together, I think we need the following changes to the Axon docs:

the pages on backing up & recovering an Axon environment should explicitly make reference to:
- exporting & importing the tracking token records into the new environment’s database (mandatory), and maybe also
- exporting & importing the data protection keys, for test environments only (unless developers are ok with seeing null for any fields marked as @PersonalData and @DeepPersonalData)
a new section / page on cloning an Axon environment for development purposes, which makes reference to some of the steps from the backup & recovery process

On that note, one final thought: when chatting with a colleague yesterday I realised that with one simple modification, the above process could also be used to create a privacy-compliant clone of a Prod environment onto a developer’s local machine, as long as any sensitive customer data has been properly encrypted using the @PersonalData and @DeepPersonalData annotations:

perform all of the steps documented above (i.e.: copy the “.events” and “.snapshots” files, export & import tracking token records, etc.), but
DON’T export & import the data protection key records from Prod into your local database (or, simply copy over the single key for your own personal Prod test account, etc.)

That way, you will have a full “Prod-like” environment, but with all sensitive customer data properly scrambled, or “cryptographically-shredded” as Axon puts it!