Dealing with brief periods of concurrency

Steven_Grimm · February 8, 2017, 12:29am

I’m working on making my single-node Axon application support zero-downtime software upgrades. One sticking point is that there will be a period where the new version is running and accepting requests while the old version is shutting down, where “shutting down” can include finishing requests to external services that end up causing Axon events to be published and commands to be issued.

For event-sourced aggregates, this isn’t a big deal; the existing conflict detection logic is sufficient. But for sagas, I could theoretically run into a case where the old and new versions both publish an event to the same saga instance at the same time, which could result in data loss.

One approach is to use AMQP or something similar to publish all events, at which point the old version can just stop listening to the queue. But that will be a substantial performance hit; my application uses lots of ephemeral events that don’t otherwise ever need to be serialized or sent over the wire anywhere.

Another approach that seems like it would solve my problem would be some kind of exclusive locking mechanism for sagas that I can activate for the duration of the handoff from the old version to the new (which should ordinarily be on the order of seconds, not hours) and then turn off afterwards. Given the short duration I’m concerned with, the performance impact of acquiring and releasing the locks is not really important; if saga event processing performance drops by 90% for 30 seconds during a code deploy, that’s fine with me. The solution also doesn’t have to scale to massive numbers of events per second or deal with network partitions or any of the other things that make distributed locking a hard problem.

In Axon 2, it seems like the easiest approach would be to subclass the saga schema (my application uses the JDBC saga repository) and make it do SELECT FOR UPDATE instead of a plain SELECT if the “I’m doing an upgrade” flag is turned on. Less sure about Axon 3 but it seems like something similar might work there. I’m using asynchronous event delivery, so in theory I think there shouldn’t be an issue with a lock remaining held across multiple event handlers and potentially causing deadlocks.

Has anyone done anything like this before? Is that naive approach going to come back to bite me later, or does it sound workable for my very constrained needs?

-Steve

Allard · February 13, 2017, 9:06am

Hi Steve,

while thinking about this answer, I realized that, probably, the risk in the current situation is actually deadlocks. Whereas the approach using SELECT FOR UPDATE would be a lot safer. Imagine two nodes, each attempting to load the same Saga. They will both read the row, putting a read lock on that row. Next, they both try to update the row, and will attempt to upgrade their lock to a write lock.
Using SELECT FOR UPDATE will cause the second node to attempt to load the aggregate to wait for the update of the first. In this case, that’s probably exactly what you want.

By the way, in Axon 3 you can also update the SQL Schema in the JdbcSagaStore (note that the repository and storage have been separated).

Cheers,

Allard