Strategies for dealing with server restarts?

Some of my sagas track interactions with remote systems. If my application is restarted, some of those interactions are guaranteed to not be active any more, e.g., REST API requests that were in flight at the time of restart.

Right now I’m dealing with this using Quartz-backed watchdog timer events. Whenever a saga calls out to a service, I first schedule a “call timed out” event, which I cancel when the service finishes. (The service classes run asynchronously and send their results to the Axon event bus so the sagas don’t block.)

This is something I want to have in place to handle legitimate timeouts where the remote service is unresponsive. As a recovery mechanism for restarts, it also works but isn’t ideal, mostly in that if the timeout period is long, the recovery can be unnecessarily delayed.

One approach I was thinking of taking was publishing an “application starting up” event in my initialization code, using the server ID as the association property. All saga classes that fire off asynchronous tasks like remote server requests would listen for the event. Any saga that started an asynchronous operation would associate itself with its current server ID, and would thus be able to immediately clean up any in-progress operations after a restart. It would remove the association when there were no more pending operations on a given server. There would probably also have to be a “server is offline” event with similar semantics for cases where we’re reducing the size of a cluster.

Is that a reasonable approach or are there hidden gotchas? What other techniques have people found work well?

-Steve

Hi,

first think that comes to mind is that this approach adds a lot of complexity.

How about blocking a shutdown while calls are in progress? Do these calls really last that long?

Cheers,

Allard

When I say “restart” I’m including “the server crashes/gets rebooted unexpectedly.” Actually that’s the case I’m most worried about since we have no control over when and how often it happens, though obviously it’s not a daily occurrence. For controlled shutdowns we initiate ourselves, we can be more graceful and wait for everything to drain.

The interactions range from a couple hundred milliseconds to several minutes depending on what we’re doing and which service provider we’re talking to.

-Steve

Hi Steven,

I don’t think there is a general one-solution-fits-all approach here. Your saga is triggered by an event message. The processing the saga does my or may not be executed completely transactionally (e.g. communication with external systems, changing state, etc). It is really up to you to decide if a message should be processed at-most-once or at-least-once.
In the first case, acknowledge a message when it is received. In the latter, acknowledge it when processing completes. If you use a message broker (e.g. RabbitMQ), you can configure this on the MessageListenerContainer. If you use a SimpleEventBus, you’re always processing in the thread that also processed the command, and usually use the same transaction.

I don’t think the basic approach to this doesn’t change with the use of CQRS.

Cheers,

Allard

Hi Steven,

I realize this is a really old thread, but I've been thinking along similar paths as you lay out here so I'd be really interested in what kind of solution you ended up using.

Cheers,
Faik

We ended up not implementing the “server started” event. After analyzing the problem further, we reached the same conclusion Allard did: it added too much complexity to be worth the benefit. Unexpected server restarts do happen (and we’ve had them in production) but they are pretty rare in practice, and the timeout events that we need to schedule anyway to handle unresponsive external services are adequate for crash recovery. Recovery is slower than it could theoretically be, but not by a big enough margin to make it worth maintaining a bunch of event handlers everywhere.

However, I should add that the reason we can get away with this is because we implemented a graceful shutdown process that smoothly hands off work to the rest of the hosts in the application cluster. Unexpected restarts are rare, but planned ones are very common. Shutting down gracefully is kind of a subtle problem that has a decent amount of application-dependent logic and depends on how the application distributes work across its cluster, but leaving aside those details, we:

  1. Stop accepting new client requests
  2. Wait for any existing client requests to finish
  3. Stop the event scheduler from delivering any new scheduled events
  4. Set our load factor to 0 on the distributed command bus so other nodes stop sending us new commands and so we send all locally-generated commands to other nodes
  5. Wait for any outbound requests to external services to finish
  6. Wait for the local command bus to finish any queued commands
  7. Wait for the local event bus to finish any queued events

We’re still on Axon 2.4, so some of the details might differ on newer Axon versions, but the general concept should still apply: follow the general flow of control in the application (client request -> command -> service invocation -> event) and shut down each step as it finishes its work so it can’t generate more work for the subsequent steps. Getting all that to work right did require a fair bit of careful design and no small amount of trial and error.

-Steve

Thanks Steve for taking the time to describe your process. These are some really valuable insights.

Cheers,
Faik