Deadlock using PessimisticLockManager

Shea_Kelly · May 16, 2011, 4:04am

When using the PessimisticLockManager we are getting deadlock between
different thread processing commands. The system current process
commands that come in via a JMS queue and we have Saga runnning that
also use the command bus to dispatch commands. The system freezes and
the the deadlock is detected in JConsole.

I think the problem is in the inner DisposableLock class. When a
UnitOfWork is cleaning up the acquired DisposableLock instances the
DisposableLock.unlock() is called and the ReentrantLock is unlocked.
Then disposeIfUnused() is called but cannot execute because another
thread is in the DisposableLock.lock() method which is synchronised.
What I dont understand is why the ReentrantLock.unlock() call is not
scheduling the other thread so that is releases its lock on the
DisposableLock instance. Perhaps there is a timing issue.

Here are the thread dumps:

Name: taskExecutor-19
State: WAITING on java.util.concurrent.locks.ReentrantLock
$NonfairSync@154fe2f7 owned by:
org.springframework.jms.listener.DefaultMessageListenerContainer#0-1
Total blocked: 4 Total waited: 71

Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:
811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:
842)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:
1178)
java.util.concurrent.locks.ReentrantLock
$NonfairSync.lock(ReentrantLock.java:186)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
org.axonframework.repository.PessimisticLockManager
$DisposableLock.lock(PessimisticLockManager.java:118)
- locked org.axonframework.repository.PessimisticLockManager
$DisposableLock@3f46b2b0
org.axonframework.repository.PessimisticLockManager
$DisposableLock.access$100(PessimisticLockManager.java:95)
org.axonframework.repository.PessimisticLockManager.obtainLock(PessimisticLockManager.java:
60)
org.axonframework.repository.LockingRepository.load(LockingRepository.java:
117)
com.maptek.minesuite.pluto.domain.planning.PlannedActivityGraphCommandHandler.handle(PlannedActivityGraphCommandHandler.java:
153)
sun.reflect.GeneratedMethodAccessor330.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
java.lang.reflect.Method.invoke(Method.java:597)
org.axonframework.util.Handler.invoke(Handler.java:110)
org.axonframework.util.AbstractHandlerInvoker.invokeHandlerMethod(AbstractHandlerInvoker.java:
77)
org.axonframework.commandhandling.annotation.AnnotationCommandHandlerAdapter.handle(AnnotationCommandHandlerAdapter.java:
76)
org.axonframework.commandhandling.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:
62)
org.axonframework.commandhandling.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:
68)
org.axonframework.commandhandling.interceptors.TransactionInterceptor.handle(TransactionInterceptor.java:
42)
org.axonframework.commandhandling.DefaultInterceptorChain.proceed(DefaultInterceptorChain.java:
60)

Allard · May 16, 2011, 6:39am

Hi,

I’ll need some version info to get started on this. Which version of Axon and Java are you using?

Cheers,

Allard

Shea_Kelly · May 16, 2011, 6:42am

Axon 1.0 release and JDK 1.6.0_21

Allard · May 16, 2011, 7:41am

Hi,

I did a little testing this morning. When I did a thread dump, I saw a similar patter that you saw (2 threads blocking on one thread, that one thread waiting on DisposableLock. So I though: gotcha! However, it seems that this was not a deadlock at all. A second thread dump shows a completely different configuration, proving that the threads were moving along.

I have modified one of the tests to run for a long time. Even after a few times, I did not manage to get a deadlock. Are you able to reproduce the problem? Can you confirm that multiple thread dumps in a row show a different wait-configuration?

Cheers,

Allard

Shea_Kelly · May 16, 2011, 7:48am

We have seen this happen a number of times after which the system never recovers only a restart fixes it. The threads seem to stay blocked in the config I posted.

Shea_Kelly · May 16, 2011, 7:59am

Also the 2 executor threads are blocked waiting on jms thread but in the dumps posted the jms thread is waiting on one of the executor threads

Allard · May 16, 2011, 2:50pm

Ladies and gentlement… we’ve got him.

I managed to reproduce the problem. There was a slight hint in your stacktrace. The problem seems to occur when an event handler dispatches a command which is handled by the aggregate that also generated the event.

I’ve created an integration test that can detect the problem and managed to fix the issue.
I will do my best to create a 1.0.1 version within a couple of hours. This is a bad enough one to fix fairly quickly.

Thanks a lot for reporting this.

Cheers,

Allard

Shea_Kelly · May 17, 2011, 2:07am

Thank for fast turnaround on this Allard. Unfortunately after
upgrading to 1.0.1 of axon this morning the deadlock issue remains

I have been able to reproduce the issue with another test case in the
ConcurrentModificationTest_PessimisticLocking class. I have email you
a svn patch containing the test and associated classes. I have done
the test against the trunk.
Basically what is happening we have commands that load multiple
aggregates for validation. We load the aggregate we a changing and a
secondary aggregate to validate against. We have a few cases like
this. The order in which aggregates are loaded seems to be important.
You will see this in the test.

As a work around we have serialised the process of commands. But
obviously this reduces potential throughput.

I suspect the solution may be to change the PessimisticLockManager to
prioritises the acquisition of locks based on the order in which
command are handled. Not sure how to achieve this however.

Allard · May 17, 2011, 6:56am

Hi Shea,

I have some good news, and some bad news for you.

The good news is that the test doesn’t show a bug in Axon itself. The bad news is that the “bug” is in the way you are using the locking mechanism.
What is in fact happening is very simple, and the textbook example for a deadlock:
process 1: lock A, then B
process 2: lock B, then A.
You’ll only need 1 or 2 runs to get the deadlock.

A very important topic in my workshops is the “Aggregate Boundary”. It is a consistency boundary in which all changes are atomic (definition from DDD). That means that an aggregate must be locked when accessed.
In general, it is considered good practice to execute a command on exactly one aggregate. Of course, that may result in events that trigger more commands on other aggregates. But these are no longer within the same atomic transaction and won’t cause deadlocks.

So the solution is in remodeling your domain, I’m afraid. Either you have split something into separate aggregates that should really by one, or you are processing 2 aggregates atomically, where it should be 2 separate operations. A third solution is to consider the “node visiting” a query process that triggers commands on nodes in a certain state. But then, a command should only influence a single node.

Hope this clarifies the situation a bit. If not, let me know.

Cheers,

Allard

Shea_Kelly · May 17, 2011, 7:10am

Thanks for the clarification. We are only using the other aggregate for validation. No event are raised against them. There are effectively readonly. The aggregate boundary concept makes sense but does that mean you cannot read other aggregates?

Allard · May 17, 2011, 8:33am

The whole idea is that for reading, you don’t use the aggregates in the command model. You use the query database. The data in there can be optimized (i.e. brought to a specific level of normalization) for the purpose of validation.

Cheers,

Allard

Shea_Kelly · May 22, 2011, 3:58am

I think I understand what you are saying. But let me restate it.

When you need to validate a command in a command hander then you query
the query database (aka query model, view model) for data from other
aggregates.

I have read some post on the DDD/CQRS google group and it seems that
querying the query model from the command model is considered bad
practice due to the eventual consistency of the query model and the
issue is complicated when the query model is scaled out.

What do you think?

Alot of the validation is cross aggregate association validation. An
example would be a command that creates a new account for a customer.
The customer aggregate identifier in included in the command along
with other details about the account. I want to validate that the
customer aggregate exists. Pretty straightforward. I call
Repository<Customer>.load(customerId) and check if an
AggregateNotFoundException is thrown. If not the customer exists and
the account is created. Unfortunately this locks the customer. so if
another command comes in that also does something with the same
account and customer then bang deadlock.

A simple Repository.exists method would assist here.

Michael_Schnell · May 22, 2011, 4:46am

Hi sheak,

I have read some post on the DDD/CQRS google group and it seems that
querying the query model from the command model is considered bad
practice due to the eventual consistency of the query model and the
issue is complicated when the query model is scaled out.

This depends only on the fact if you can live with eventual
consistency or not. If the use case really requires a 100%
consistency, I'd simply create a command side database that keeps
track of the unique keys. An example for a unique email address can be
found here:

http://code.google.com/p/axon-auction-example/source/browse/trunk/auction-command-server/src/main/java/org/fuin/auction/command/server/handler/RegisterUserCommandHandler.java

It uses a "ConstraintSet" that keeps track of the uniqueness of the
email. In this special case I think it would have been also possible
to use the query side but I consider that a matter of taste.

Cheers,
Michael

Allard · May 23, 2011, 5:11pm

Hi,

I am not able to write long emails at the moment. I will get back to you as soon as I can.

Cheers,

Allard

Allard · May 27, 2011, 9:45am

Hi,

sorry for the late reaction. Due to some personal circumstances, I wan’t able to respond any earlier.

When choosing aggregate boundaries, you also choose the consistency boundaries. Per definition. That means that if 100% consistency between two entities is important, they must belong to the same aggregate. If each customer needs an account, and an account must belong to exactly 1 customer, then they probably belong to a single aggregate. But beware, this may very well depend on the context. There might be more than 1 representation of a “Customer”, if you have more than one context.

If entities belong to different aggregates, like your current situation with Customer and Account, you should use information in the query database. I don’t agree with it being a bad practice. My simple rule is always: the aggregate should get all the information it needs to decide what to do. Some of that information is contained in the aggregate itself (because is logically belongs there), the rest should be provided via the command (method). Since commands (in Axon) are objects that are remote in many cases, I like to keep them small, and gather more information in the command handler, which passes all required information to the command method on the aggregate.
Do keep in mind that you don’t really have “the” query model. Each target audience has one, the command side included, for cross-aggregate information sharing. For maintenance purposes, though, I personally let these components share a data-source. I don’t like having 3 identical tables, just because three audiences need the same data.

I have had the discussion of the Repository.exists() method more than once. My counter-question is always: if a customer with a certain ID does not exist… how on earth did you get a hold of that ID? If it’s a user-provided ID, you should always validate it against the query database. Usually, I use UUID’s as aggregate identifiers, and prevent the users from entering them. They ususally provide a more human-friendly identifier, typically functionally defined.

The eventual consistency is really not as dangerous and complicated as it seems. In most scenario’s, it means that the query database is a few millis behind. Of course you should consider what happens if things go wrong, but don’t over-engineer it. You’ll be fine. If you really won’t be fine, then it’s probably an indication that you chose the wrong aggregate boundaries.

In my workshops, I give hints on how to find aggregate boundaries: think of scenarios. What does a user want to do, and what are side effects? Typically, side-effects can easilly be eventually consistent. The user entering the data doesn’t care about it.

Hope this helps.

Cheers,

Allard

Shea_Kelly · June 7, 2011, 2:49am

Thanks Allard. There are few thing we need to review I think