Async Saga

Now that I have moved all of the validation to sagas I am seeing a dramatic increase in speed of the system. The pattern I'm using for my sagas is to start/react based upon domain events and then schedule a "Retry" event in 5ms to being the actual validation piece.

The SagaEventListener for the retry does the following:

- cancel the scheduleToken if it is not null
- schedule another retry event in 5 minutes time
- in a try/catch block:
-- communicate with external source to validate data
-- end the saga if the validation is successful

So I am ending the saga within the try/catch. And the previous schedule will automatically retry on any errors.

In a normal flow this seems to work great. But I like to push things to stress it a little and I found a crack. So by sequentially dispatching the commands that would come from the UI the first validation saga causes a deadlock. If I change the schedule time from 5ms to 2s then it is usually starts after the last command has been dispatched.

So I looked into async sagas and added an "<axon:async />" element to my saga manager. I supplied it my Spring JPA transaction manager and a dedicated thread pool defined as "<task:executor id=".." pool-size="15" />". No other parameters are given to the async element.

The following stacktrace is thrown:

Caused by: org.quartz.SchedulerException: Unable to unschedule trigger [DEFAULT.6da64b5bd2ee-011b7330-9262-4b79-a481-851858a8712c] while deleting job [AxonFramework-Events.event-e8cd543d-3297-4c9a-936b-86a80b2a5ed5]
  at org.quartz.core.QuartzScheduler.deleteJob(QuartzScheduler.java:948)
  at org.quartz.impl.StdScheduler.deleteJob(StdScheduler.java:292)
  at org.axonframework.eventhandling.scheduling.quartz.QuartzEventScheduler.cancelSchedule(QuartzEventScheduler.java:128)

At this point the Quartz scheduler is dead and holding onto resources. Even the Oracle tables are locked. The only way to stop the servlet container is to kill the java process from the system.

I moved everything in the "retry" SagaEventListener into individual try/catch blocks and this at least doesn't kill the Quartz scheduler.

So am I missing something wrt the setup of the saga manager or is it simply best practice to have quite defensive code in a Quartz job?

It certainly does help to write out a situation to be able to see the problem. This one was a head slapping moment for me.

I am calling “end()” within my code to validate from an external system. In that validate method I’m dispatching a command relating to the success of that validation. And here it is… I should be listening for when the aggregate sends out the “Valid” and “Invalid” events and THEN end my saga.

I love the event-driven architecture and I’m reaping in the benefits on the UI side. Once in a while I just need to remind myself not to think procedurally.

Randy.

Hi Randy,

I am glad you solved the problem. I did have a look into this issue yesterday, but didn’t get to sending a response.

It seems that deadlocks in Quartz are not uncommon, but are mostly related to incorrect use. At Quartz, they also state that some databases (especially Oracle RAC) have some difficulties with locking. There’s some info on their FAQ: http://quartz-scheduler.org/documentation/faq#FAQ-cmtDead

The event-driven approach is quite addictive. I am doing two projects concurrently at the moment (besides Axon), one is EDA, the other is ‘traditional’. The traditional approach often leads to issues that are a lot easier to solve in an EDA. And then I am not even talking about (perceived) performance in the UI…

Cheers,

Allard

Thanks for the links, Allard. I have previously read the issue and that said I’m quite confident that my setup is correctly configured for Oracle.

That said I have noticed a new problem in that sagas are not starting under simulated stress by executing five (5) contracts. I have two sagas which manage different lifecycles but both start with the same domain event. One of the sagas always starts and is always in the SAGA_ENTRY table. The other saga will generate and persist a few but then the rest are never started. And then I have three validation sagas which start based upon the changing of a field. Two of these sagas always start but the third is hit and miss in even starting.

I came across an older post from you where you had suggested creating an individual saga manager for each saga. I just tried that and now all my sagas are starting, executing and ending appropriately. I expanded the stress simulation by creating 100 contracts and then individually verified that each contract was properly validated by the 3 validation sagas and that each one had 1 of each of the two lifecycle sagas. While I’m content in managing sagas with an individual managers I suspect that it does bring my Axon (and not Quartz) configuration into question.

Is managing multiple sagas in a single async saga manager something that I am pushing too much?

Randy.

Hi Randy,

although there might be a cause and effect relationship between this issue and the quartz problem, I don’t think they are directly related.

Whether you choose to use a single Saga manager for multiple saga types, or one per type is a configuration choice. An async saga manager uses a fixed number of threads (typically 2, but it’s configurable). Some sagas need to perform better than others, so some will have their own manager, while others share one.

So the fact that behavior is different in these scenario’s makes me believe there is an issue somewhere. I’ll look into it. Meanwhile, using 1 saga type per manger sounds like a viable workaround.

Cheers,

Allard

Hi Randy,

I had some time to kill today on my way to Denmark, so I looked into the issue. It didn’t take long to reproduce, and the fix was also very simple. Finding the cause was a bit harder, though.

Anyway, a new snapshot has been deployed that contains the fix for the async saga problem.

Cheers,

Allard

That’s fantastic Allard! Thank you for looking into this and for making the change so quickly.

Hi Allard,

So I continue to come up against issues with sagas which causes the saga manager to crash. Here is the desired workflow:

  • user sets the customer code for order
  • creates saga to validate customer code against external system (uses RabbitMQ to communicate)
  • saga receives result and fires command for either found or not found
  • aggregate processes command and generates events for customer code found or not found
  • saga listens for found or not found events and wraps with an @EndSaga annotation

The user will only likely set the customer code only once. However the quality assurance team likes to push the boundaries with what-ifs and therefore changes the customer code a number of times. After only a few changes the system will no longer validate customer codes. Even on new order aggregates. It doesn’t even create the saga.

From the logs I see that the saga manager is going to each thread (I’m continuing to use async sagas) and is starting a Unit of Work with multiple threads for this single event. I would on occasion also get exceptions about committing an saga that have already ended.

So as a test I removed the association values in the @EndSaga annotated methods. This allowed the user to change the customer code a number of times without having any issues. The worker threads would only create on unit of work. And no more exceptions.

But this just seemed to prolong the cause of the issue which I haven’t nailed down yet. Give it a day of users making multiple changes and eventually the saga manager will stop registering sagas. And by not registering I mean: the saga class isn’t instantiated and therefore not persisted or executing events.

I still have individual saga managers for each saga. Each manager has its own task executor. All methods which communicate to external services are wrapped in try/catch.

I’m at a loss for what to do next other than rewriting the validation to not use sagas.

Any assistance would be greatly appreciated.

Randy.

Hi Randy,

this sounds like a nasty one, and I’ll be looking into this. But before I do, are you aware of the fact that @StartSaga will only cause a Saga to be invoked if no saga of that type already exists that is associated with that event? There is an attribute one the @StartSaga annotation that allows you to force creation.

What Saga repository do you use?

Cheers,

Allard

Hi Allard,

I am using the JPA repository as defined using the <axon:jpa-saga-repository /> element.

I do believe that I understand how the sagas are matched to handle events. Is it true that I can have any number of saga definitions (as classes not instances) that are started from the same type of event/association value, ie: ValidateClientSaga.java, ValidateWidgetSaga.java, etc fired with the same OrderCreatedEvent and associated with the same orderId?

That said, here is how my saga fires for validating the client:

  1. @StartSaga assocated with the orderId from an ClientCodeEstablishedEvent
  2. schedule a ValidationScheduleEvent to fire in 50ms (this event is a protected static class defined in the saga class)
  3. @SagaEventHandler associated with the orderId from the ValidationScheduleEvent
  4. try/catch block to validate client with external system; command fired for either found/not found; exception caught and event rescheduled in 60 seconds
  5. @EndSaga associated with orderId from the ClientFoundEvent
  6. @EndSaga associated with orderId from the ClientNotFoundEvent

Let’s assume that the external server is in good working order so that #4 never has to reschedule. And let’s just pick the happy path that client was found.

That means that #5 will execute once the ClientFoundEvent has been fired from the aggregate. Saga should be done. And a query of the SAGA_ENTRY table does prove that the saga has been removed.

The above will also show in the logs that only one Unit of Work was created.

Now let’s change the client code which will create a new saga starting with #1. However now there will be two threads which have a Unit of Work created. There will be two processes which fire up and try to validate the client. And then the original one will throw and exception that it the saga changes cannot be committed as it has already been committed.

And this will continue on and create additional Unit of Work each time I change my client ID.

But as I said this error and additional Unit of Work can be eliminated if, in #5 and #6, I unassociate the saga with all of its associated values. I will do another run through but it’s like once a thread has been associated with a saga it remains associated even after the saga was to have ended.

Also, I am defining my executor as:

<task:executor id=“clientSagaManagerExecutor” pool-size=“30” />

And the saga manager:

<axon:saga-manager id=“clientSagaManager” event-bus=“eventBus” saga-repository=“sagaRepository”>
<axon:async processor-count=“30” executor=“clientSagaManagerExecutor” transaction-manager=“transactionManager” />
axon:types
org.example.processing.ValidateClientSaga
</axon:types>
</axon:saga-manager>

Thanks,
Randy.

Hi,

maybe this should have been the first question to ask: which version are you using (milestone or smapshot)?

Sorry I usually state that. I am using the 2.0-SNAPSHOT from a git pull from yesterday morning.

Hi Randy,

I went through the code last night, and did find some issues with wagerly create Unt of Work. This results in you seeing two uow created on the second event. However, it does not explain the two validate actions being executed.

You said you saw some exceptions. Could you send their stack trace? Maybe they provide some insight in the cause-effect tree.

Cheers,

Allard

Hi Allard,

Instead of using the debugger I placed a log statement and I was encountering only one execution. It was the debugger in IntelliJ that was breaking into the validation code twice in the same thread.

I went back through my notes and found that the stacktrace was due to a null being associated and blowing up on line #146 in JpaSagaRepository.

With regards to this issue it was the log entry produced from line #166 in the JpaSagaRepository.

Randy.

Hi,

after some investigation, I noticed that the problem is probably something else. Unlike the “regular” SagaManager, the async version didn’t properly check for “null” associations. This happens when an event doesn’t return a value for the association property. In your case, clients that were created without the reference number you want to check, will slowly cause the SagaManager’s threads to die. One at a time.

I’ve committed a fix and built a fresh snapshot version. Can you verify if it has solved the problem?

Cheers,

Allard

Hi Allard,

coincidences sometimes are surprising. In recent weeks, we were experiencing a curious behavior by the AsyncAnnotatedSagaManager:
After a few infrastructural exceptions (very limited fortunately) SagaManager suddenly ceased to handle events associated with the Sagas (no matter if new or existing) for no apparent reason. At that point the only option was to restart the application.

After some investigation my colleague has identified the cause of the problem in the FatalExceptionHandler used by default in the Disruptor.
This ExceptionHandler first writes to the log the exception and then raises again causing the end of the active thread.
I think the above is consistent with what you say and with was described in this post https://groups.google.com/d/msg/lmax-disruptor/_qvNtWiP0HM/0UZesotGK6oJ

We believe we have solved the problem by initializing the Disruptor with a custom ExceptionHandler that merely just log the exception, but does not throw it again.
We were testing the fix to be 100% sure that this was indeed the case before opening a related issue when I read your post…

Apart this story I wonder if the use of an ad-hoc ExceptionHandler is safer in the case of exceptions within the EventProcessor.

What do you think about it?

Best Regards,

Domenico

Hi Domenico,

Thanks for sharing your insights on this one.
I think adding a different exception handler to the Disruptor is a good part of the solution. I read that Akka, for example, adopts the ‘let it crash’ ideology. When a Fatal Exception occurs, they basically do a System.exit. I am not sure I’d want the same for Axon. I’d probably want to adopt ‘the show must go on’ ;-).

Another part of the solution will be to make the Async Processor implementation more robust against failure. When persistence fails (with an IO related issue), the disruptor should either block until the connection is back alive, or keep a Saga in memory until it’s capable of storing it in the database.

I will provide a quick fix on the short term, but expect the final 2.0 version to have a more robust solution.

Cheers,

Allard

Hi Allard,

Improving robusteness of sagas against oddities would be a great plus.
I cannot agre more.
BTW considered that the ExceptionHandler interface has the following
signature:

void handleEventException(Throwable ex, long sequence, Object event);

Leveraging the sequence argument maybe it should be possible to
realize a configurable implementation capable of applying the
appropriate strategy depending on the transient nature of the
exception and respecting events ordering.
For example something like a RetryStrategy for the transient
exceptions and an IgnoreAndMoveOn for the non transient ones.

Cheers,
Domenico

Hi Allard,

I’ve been testing your changes since Friday and it has been rock solid. I even reverted the code I had in my sagas to an age of two weeks ago and there has been no breakage for initiating the saga.

Thanks a lot for your attention on this matter.

Randy.