Some of my sagas track interactions with remote systems. If my application is restarted, some of those interactions are guaranteed to not be active any more, e.g., REST API requests that were in flight at the time of restart.
Right now I’m dealing with this using Quartz-backed watchdog timer events. Whenever a saga calls out to a service, I first schedule a “call timed out” event, which I cancel when the service finishes. (The service classes run asynchronously and send their results to the Axon event bus so the sagas don’t block.)
This is something I want to have in place to handle legitimate timeouts where the remote service is unresponsive. As a recovery mechanism for restarts, it also works but isn’t ideal, mostly in that if the timeout period is long, the recovery can be unnecessarily delayed.
One approach I was thinking of taking was publishing an “application starting up” event in my initialization code, using the server ID as the association property. All saga classes that fire off asynchronous tasks like remote server requests would listen for the event. Any saga that started an asynchronous operation would associate itself with its current server ID, and would thus be able to immediately clean up any in-progress operations after a restart. It would remove the association when there were no more pending operations on a given server. There would probably also have to be a “server is offline” event with similar semantics for cases where we’re reducing the size of a cluster.
Is that a reasonable approach or are there hidden gotchas? What other techniques have people found work well?
-Steve