I’m working on making my single-node Axon application support zero-downtime software upgrades. One sticking point is that there will be a period where the new version is running and accepting requests while the old version is shutting down, where “shutting down” can include finishing requests to external services that end up causing Axon events to be published and commands to be issued.
For event-sourced aggregates, this isn’t a big deal; the existing conflict detection logic is sufficient. But for sagas, I could theoretically run into a case where the old and new versions both publish an event to the same saga instance at the same time, which could result in data loss.
One approach is to use AMQP or something similar to publish all events, at which point the old version can just stop listening to the queue. But that will be a substantial performance hit; my application uses lots of ephemeral events that don’t otherwise ever need to be serialized or sent over the wire anywhere.
Another approach that seems like it would solve my problem would be some kind of exclusive locking mechanism for sagas that I can activate for the duration of the handoff from the old version to the new (which should ordinarily be on the order of seconds, not hours) and then turn off afterwards. Given the short duration I’m concerned with, the performance impact of acquiring and releasing the locks is not really important; if saga event processing performance drops by 90% for 30 seconds during a code deploy, that’s fine with me. The solution also doesn’t have to scale to massive numbers of events per second or deal with network partitions or any of the other things that make distributed locking a hard problem.
In Axon 2, it seems like the easiest approach would be to subclass the saga schema (my application uses the JDBC saga repository) and make it do SELECT FOR UPDATE instead of a plain SELECT if the “I’m doing an upgrade” flag is turned on. Less sure about Axon 3 but it seems like something similar might work there. I’m using asynchronous event delivery, so in theory I think there shouldn’t be an issue with a lock remaining held across multiple event handlers and potentially causing deadlocks.
Has anyone done anything like this before? Is that naive approach going to come back to bite me later, or does it sound workable for my very constrained needs?
-Steve