Best approach for a persistent retry mechanism

Sam_Kruglov · September 8, 2019, 8:56am

So, there is a saga event handler, which fires a bunch of commands asynchronously. So, when it fires all commands, it thinks that this event has been handled and that’s it. Not I can only rely on my command handlers to actually finish the job. Each of these command handlers will call a third-party API, which can be down. I have a RetryScheduler configured, so each command is retried properly, but if I kill my server, on startup, it does not continue any retries and that event handler looks successful, while some commands are not executed.

So, I was thinking of removing that RetryScheduler and publish a TransientExceptionEvent inside the command handler, so that I can have an event handler for a retry. I know that when I restart my server it will continue to handle event where it left off, but I just feel like I am reinventing the wheel. Is there a proper way of doing retries, so that it is crash-proof?

Sam_Kruglov · September 8, 2019, 9:02am

I don’t fully understand the use case when you want to retry something but you are okay with it not being reliable. If I want to retry, I want to be absolutely sure that it completes no matter what happens. Otherwise, my event chain will be stuck in an awkward spot.

stijnhaezebrouck · September 8, 2019, 11:05am

just my idea:

Your Saga, could, when processing the event (A), schedule a deadline event using the deadline manager (Axon feature).Then, your event handler could then send the command (B)
Meanwhile, you can wait in your saga on an event (B) resulting from the command (B) that it has sent.
When processing event (B) you could: unschedule the deadline event (using the deadline manager), then close the saga.

Axon can work with a Deadline Manager based on quartz, which is database persistent.

If you do not want to use quartz, there is an alternative:
Create a view based on event (A). One event (B), update your view to mark the work as done.
Use spring scheduler to every N seconds (or hours) queries the view, and (re)send all the commands B for the work that has not been done yet.

Sam_Kruglov · September 8, 2019, 1:17pm

Thanks, that’s interesting. So, you suggest to remove RetryScheduler and only use Deadline feature instead? Otherwise, not sure if I can prevent RetryScheduler AND Deadline event sending the same API request in parallel. Or I could set the deadline very far so that RetryScheduler would be exhausted by that time.

stijnhaezebrouck · September 8, 2019, 5:46pm

No indeed, I would use the DeadlineFeature and not the retry scheduler.

The fact that you need to keep on retrying until a certain condition is done, is a business transaction, managed by the saga.
You will need to get some confirmation from your external api that the api call was handled correctly (or something you can check). And your external api would need to be idempotent, because in the worst case, your saga did send command B, and your deadline manager fired the expiration event, causing your saga to send command B again. (Your saga needs to listen to the expiration event that itself registered before in the deadline manager).

Your saga will schedule the deadline, but also immediately send the required command.
If executing the api was successful, you should capture that with an event, to which your saga can listen, so that it can be closed.
But if this does not happen, the deadline manger will fire an Expired event, to which your saga must also be listening.
Then the saga will send the command again (and schedule a new expiration with the deadline manager)

Also bare in mind that the persistent implementation of Quartz is not very accurate. It makes no point to schedule a deadline in 5 seconds, because it might be fired after 5 minutes. Quartz (the jdbc implementation) is just no that accurate.
But if you schedule a deadline in 1 min. And then your application goes down for 5 min, and comes up again, the deadline manager will fire the event.

Sam_Kruglov · September 16, 2019, 7:02am

Thanks! So, I’ve decided to keep RetryScheduler only for transient errors of Axon and Database. For 3rd party systems I publish a TransientErrorEvent and do retry manually, while each retry is accompanied with a Quartz deadline (retry interval + 1min for 3rd party to respond). So, if my db is down, I retry in-memory, if my 3rd party system is down, I retry in-memory also but Quartz makes sure I keep doing that even if I go down and back up. Very satisfied with that solution, feels indestructible!