Best practices for zero downtime deployments / event replay / rebuilding projections

0xFFFF_FFFF · July 25, 2023, 4:25pm

I had a look around this forum and the official documentation but I couldn’t find much information on how to approach / set up a blue-green environments for zero-downtime deployments. This is especially important for our team, as past event replays have taken up to 2 hours to finish in our Production environment.

Below is what I had in mind for a zero-downtime deployment process:

Background: our system is essentially just a REST API that receives incoming HTTP requests, processes them, and sends back a response. Our Axon Server, DBMS, and client applications all run inside of k8s pods.
introduce a new message queueing (MQ) layer in our backend that caches incoming HTTP requests for a defined amount of time (the length of the deployment)
when deployment begins, pause / queue up incoming HTTP requests until the Axon event store has finished processing any in-progress transactions, and make a copy of the latest “.events” and “.snapshots” files in the primary Axon k8s pod
once the “in use” event store files have been copied, resume sending HTTP requests to the “live” / green environment (returning any responses back to the clients), and queue up a duplicate set of these HTTP requests for the blue environment to receive once it’s ready
- (and obviously copy over all the other .events and .snapshots files after this too)
next, spin up a new “dark” / blue environment the same way we do now, including, if applicable, any event replays / projection refreshing that we need to run (this part could take hours to complete)
once the new blue environment is online, bring it up to date with green by “flushing the queue” / sending it all of the HTTP requests have queued up
once blue’s queue is empty, cut over to the new environment (i.e. start forwarding HTTP responses from blue instead of green) and take down the old environment

Is the above a good approach, and / or is there a better way of achieving what we’re after?

Gerard · July 25, 2023, 5:04pm

I don’t really understand why you need to pause/copy/resume/replay the server? If you need a new projection, or change the projection, you could use a new name for the processing group, and run a seperate app that create the new projection, before switching over.

The nice thing about cqrs, is that the events stay the same. So you can simply get all the events from the beginning, while processing can continue.

Bert_Laverman · July 26, 2023, 7:00am

Hi,
I hope you don’t mind if I do add a rather obvious plug here for a licensed version of Axon Server. It uses consensus-based clustering to allow not only zero-downtime upgrades but also prevent unplanned downtime due to Kubernetes maintenance actions. Don’t forget, if k8s wants to upgrade the Worker node, it will vacate that node causing temporary downtime during the Pod’s migration.

Using a 3-node cluster (odd numbers guarantee a majority is possible) will get rid of all of the above. Depending on how much formal support you want the costs can be pretty acceptable compared to the hoops you have to go through with clever scripting… (Which I have done my bit of in the past)

Cheers,
Bert

0xFFFF_FFFF · July 26, 2023, 10:10pm

Hi @Gerard, sorry but I don’t follow… Could you please elaborate a bit more? Ideally, I’d like to simply modify our deployment process to create a zero downtime “cut-over”, such as with a blue-green environment, and not have to change processor group names or deploy some kind of temporary “helper application” alongside our other client applications every time we want to refresh projections (if I’m understanding you correctly). I want the code that we deploy to be 100% production code, every time, and not temporary / helper code.

@Bert_Laverman we’re already using the licensed “Enterprise Edition”, and our Axon Server is deployed as a 3-node cluster, each node running inside a k8s pod. Can you please explain how a clustered Axon Server would allow us to rebuild projections “in the background” as I described above?

Gerard · July 27, 2023, 7:01am

You could have all that in code. It doesn’t need to be a seperate application, you could use a profile. And you can code the deployment. So first deploy the new ‘blue’ app, with the profile which just builds up the new projection. If some endpoint marks it done, replace both the ‘green’ and the ‘blue’ with a migration profile, for a ‘blue’ app without a migration profile.

Bert_Laverman · July 27, 2023, 7:36am

@0xFFFF_FFFF Your posting title starts with “Best practices for zero downtime deployments.” That is what my comment targeted. You also mentioned copying event store files in the “primary Axon k8s pod”, which sounded to me like a single-instance setup. What I missed is that you are only talking about deployments of the Axon Framework apps, not Axon Server.

0xFFFF_FFFF · July 27, 2023, 2:33pm

@Bert_Laverman Gotcha, sorry if I wasn’t clear initially. To reiterate my original question, I’m (still) looking for a way to deploy a full, typical Axon Server & Axon Framework-based backend — which in our case means:

“the Axon Server”, an Enterprise Edition 3-node Axon Server cluster
“the database”, a single postgres DBMS
“the client apps”, of which we have 6

…in such a way that results in zero downtime for the end user (where “zero downtime” is defined as no data lost, all requests answered within, say, 10 seconds).

And where Axon Server / Axon Framework comes into the picture is, this deployment process should also provide a way to do event replays & projection building (which takes over 2 hours in our Prod environment), also in a zero-downtime way.

So, given all of this information, does my proposed deployment solution described in my original post above make sense, and / or is there a better way to do it?

Bert_Laverman · July 28, 2023, 8:26am

Ok, in that case I would suggest to consider using separate contexts rather than fully separate deployments of Axon Server. Copying/cloning the Event Store can then happen “locally”, which should improve speed. Also, there is no need to deploy and switch over to a different Axon Server cluster.

Likewise, given the coordinated fashion of your process, you could consider copying the PostgreSQL database with the view models rather than rebuilding them, given that the Event Processor tokens stand for “a moment in time.” Assuming your view model building handlers are capable of dealing with events that were already processed, this approach should considerably reduce the “rebuilding time.”

As a Software Architect, I would even go a step further. The trigger for this is that you appear forced to build a complete environment next to your existing one and carefully coordinate a switch-over. This indicates to me that the size of your “minimal deployable unit” is practically the entire application, which is holding you back

I was thinking along the lines of what GitHub is doing. Here the software, structured in small components, is aware of the possibility of having multiple versions of the same component working. The old version keeps working normally while the new version’s data is compared with the old rather than stored. When enough test time has verified that the result is as expected, the (feature) switch makes the new version leading.

This is definitely not a simple change, but could be an approach to strive towards

Cheers,
Bert Laverman

rafa · July 28, 2023, 11:03am

We had the same requirements and we built a Kubernetes Operator to automatise the orchestration of Blue and Green services. Let me try to explain.

The k8s operator drives the process. It tells Blue and Green services in which phase of the process they’re in. The phases are SCHEDULED, INITIALISING, SERVING, STANDING_BY, CLEANING_UP, OBSOLETE.
Those phases make the services operate in different modes. They are injected as env variables that we then use as Spring Profiles. We make sure only the service in the SERVING phase has @QueryHandlers enabled. A newly deployed Green service would be marked as INITIALISING. During this phase, it reads the event store and creates a new projection from scratch. The service initialises itself. The phase usually ends when the event processor has reached the head of the event store and it has run some tests that verify everything went fine. It then notifies the operator that it’s ready.
The operator would mark this new project as SERVING and the previous one (Blue) as STANDING_BY.

Each service suffixes the event processor name and the database name with either “-blue” or “-green”. (Actually, we use major.minor numbers instead of blue or green but forget that for simplicity). This suffix is what allows the services to read the event store from the beginning and create their own projection in a new database that started empty.

If we detect any problem, we can rollback Green and the operator will mark Blue again as SERVING.
If everything went ok, after some configurable time the operator would move the STANDING_BY Blue service into the CLEANING_UP phase where it should delete its database.

Hope you get the general idea.

Do you think it would work for your use case too?

Bert_Laverman · August 2, 2023, 7:40am

Hey Rafa,
That sounds like a nice approach that would work with using different contexts rather than different Axon Server instances, as (from the client app’s perspective) the only difference would be a change in the “axon.axonserver.context” property rather than “axon.axonserver.servers”. Note that in either case, the change in value would require the app to restart for it to pick up the new values. AFAIK a changed value during runtime would not be noticed and cause connections to be closed and re-opened.

rafa · August 2, 2023, 9:02am

Bert, we don’t change the context or servers. The new service starts a new projection in its database by changing the event processor name. Both the old and new services read the same events in the same context.