How to diagnose production issues

edreyer · October 2, 2021, 2:23am

Hello,

We are at a phase with our development where we are comfortable developing with Axon (still early days for us).

Before taking this to production, our team has a number of questions about how to diagnose production issues. Everyone on the team has deep experience using traditional Spring Boot apps using JPA and a RDBMS or NoSql system as the system of record.

We have zero experience administering and diagnosing issues in a production environment using an event sourced system like Axon.

Is there any documentation or examples, or really anything, that can help give us the confidence we need to put this system into production?

Some questions:

Axon server is down. What now?
Axon server seems to be running, but we’re losing some (or all) events. How do we diagnose?
For a business logic question, what’s the best way to peek into the system to get the current state of various entities? (Similar to how we’d do this with a RDBMS)
If we find that the system was somehow in a inconsistent state (e.g. due to a bug), how can we fix that? (e.g. In a RDBMS you can always run a SQL statement to fix the state).

I’m sure there are other questions, but this gives the general idea.

Thanks for any info you can provide.

Bert_Laverman · October 6, 2021, 9:34am

Hi Erik, good questions!

To start with the first point, you can get detailed information from the so-called “Health Actuator” at “/actuator/health”, provided Axon Server is still responding to HTTP requests. If it isn’t, your only recourse is a restart. If it is responding, we also need to make a split between Axon Server SE and EE, because the latter is typically deployed as a (Raft-based) cluster, potentially serving multiple contexts. In the latter case “DOWN” is a per-context state and you’ll need to look at the information provided by the actuators and the UI to determine root causes and possible actions. In the case of Axon Server SE, your causes are more limited and likely related to disk space and networking issues. Generally, you’ll want to stay ahead of such scenarios and ensure you have monitoring installed. Axon Server is using Spring-boot and through that supports data exporting to tools such as Prometheus.

For the second my first response would be to ask about how the Event Store is configured, because applying an event should throw an exception if it cannot be stored, and when it is stored it cannot get lost. Please note that the messaging functionality in Axon Server is not a generic Message Bus; it is part of the (Axon Framework supported) CQRS and Event Sourcing infrastructure. This implies that an Event Store must be configured and events can always be replayed from there. “Losing Events” could only happen if they are skipped, or the component generating those Events ignores Exceptions.

In a CQRS application, you can use State Stored Aggregates, which implies you could request that state from the database. However, since all commands (normally) result in Events notifying the rest of the system of the changes to the aggregate, you could also build the current state using a replay of those Events and get rid of the database. This can be used for the command handler itself also and is what we call Event Sourcing. The more natural way to “peek” at the current state is by building a View Model based on the Events, which can be tailored and optimized to the kind of Queries that are going to be placed on it. This way of structuring your application is what is referred to by the acronym CQRS, as it stands for “Command-Query Responsibility Separation”.

Lastly, if you want to fix an inconsistent state, you need to provide the ability to send commands to fix that. Having an unmodifiable audit trail of all changes thanks to the Event Store is one of the strong points of CQRS and Event Sourcing applications. That said, there are ways to actually change the Event Store files on disk, but this is something that should be done with extreme care.

I hope this helps you further. Let me know if you have further questions or want to set up a call to discuss this further. I would also recommend looking at the pieces of training available in the Academy and try to join the Axon Server Online training sessions. There is one coming up next week, on Wednesday and Thursday, October 13th to 14th, in the mornings. (9:00-11:30 CET) We have another one in November on afternoons, and an evening one in December.

Cheers,
Bert Laverman

edreyer · October 6, 2021, 8:59pm

Thanks Bert! Really great writeup and very helpful!

I’d like to find out a bit more about this bit:

“Lastly, if you want to fix an inconsistent state, you need to provide the ability to send commands to fix that.”

I’m wondering what the best practices are here. If people end up having an inconsistent state due to a bug, or partial outage, or really for any reason you’re saying that we need to author a code path in the product that will allow us to compensate for the problem.

In practice, this seems like we’d have to go through a whole development cycle where a story is created, a developer picks up the story to author a solution, the code is then QA’d in a staging environment, then finally pushed to production where that fix could be executed.

If a data inconsistency results in a high priority/severity problem that requires a compensating command to correct, how are teams handling this when time is of the essence and this full dev cycle may take more time than a business can afford?

Thanks,
Erik

Bert_Laverman · October 7, 2021, 1:07pm

Erik, thank you! Too long since the last time I was able to climb on a soapbox.

Ok, there are a few things at play here. Believe me, what I’m writing about is backed by experience. I have worked on systems used for county administration and later at an insurance company, and I have been involved in precisely that type of situation you describe. More than once!

Basically what this type of procedure does, is to introduce is an unaudited data fix. If that doesn’t sound scary enough: what if other systems receive the initially incorrect data and start processing it? We can hope they fail due to the inconsistencies, but they are just as likely to introduce downstream problems. As a result we potentially need to fix a large set of data stored in several locations while processing the faulty data was already started and logged. The result is going to be that the audit logs we have already produced, will no longer match the data.

Having a “traditional” database allows us to do such a fix outside of the application, but the application itself needs to be fixed also, so we are looking at two high-prio fixes, which are likely going to claim senior developers as an alternative to a lot of tests, and business involvement to verify the resulting data. So the costs of this emergency fix are high! Please note that with a CQRS (Axon Framework) application we could still choose this approach by using state-stored aggregates.

If we look at the common practice for highly-audited systems, then a data fix is “just another transaction”. Fixing bugs by “just” fixing the data is risky, and we still need to fix the bug.

What I’m talking towards is actually an old hat: we need to drive down the cost of change, and the biggest factor is the speed with which we can move changes to production. This starts with introducing DevOps practices for automated builds and testing, supported by using Agile “methodologies”. Those quotes are on purpose, because (as I’ve seen in reality) just forcing all IT Project Managers to become Scrum certificated generally doesn’t provide long-term benefits. You need to bring the horizon of change closer, so the sizes of individual changes go down and deployments to production can happen more often. And, low and behold, this will also drive down the size of deployed components, so we don’t get hung up on all the dependencies between different parts of our codebase, the so called “Ball of Mud.” Tests become a lot easier also, and automation of tests using Cucumber brings us towards the point that even Business will accept that they cover sufficient ground for a production push, because they can specify them in (relatively) normal language, without having to resort to coding. And then the surprise when someone looks at the changed codebase and asks when you switched to a Micro-Services Architecture…

Cheers,
Bert

edreyer · October 12, 2021, 5:14pm

Thanks Bert!

Very helpful. Thanks so much for thoughtful response. It seems, as you suggest, you want to create as streamlined a process as possible so that you can get bug fixes to production quickly and efficiently.

With state stored aggregates, you could deploy the bug fix, and a DB patch (via FlyWay, or similar) to fix any broken data.

With event sourced aggregates, apart from fixing the original bug, you’d create a codepath through Axon to perform compensating operations. I could see how, at times, those codepaths could be short-lived. E.g. You add them to a release to help correct some data, and removed them in a subsequent release after the data has been corrected.

The event sourced approach to fixing a bug seems a bit more complex, but not all that different really.

Cheers,
Erik