CommandBus.dispatch erratically hangs

coderinabstract · September 6, 2011, 7:48am

This is a very not easily reproducible problem.

In a eclipse environment, this problem never occurs with tons of testing, however with tomcat or Jetty standalone server and a webapp with axon server side, in a call to commandBus just randomly hangs i.e. (no logs from that call)… Saga processing is happening in the background and see its log info, however retrying the same call few times it starts working and commandBus does progress ok. Also, noticed that only one command handler method shows this behavior, other don’t, sorry for the vagueness as this is what I thought a client issue has not got me on this call chain and can’t think of why?

This call does however manipulate the aggregate and its interactions with Sagas very significantly and other bounded contexts in a fairly chatty manner. Will continue to dig further. What is bothering is that this has never showed up in tons of app testing within eclipse and its embedded webserver environment.

Again… any thoughts as I am struggling to reproduce effectively to get to a root cause.

Thanks

Allard · September 6, 2011, 5:08pm

Hi,

which version of Axon are you using?
Do I understand that the application hangs, but then continues normally when more requests arrive?
Have you tried using jstack to investigate the call stack when your application hangs? That will give you information about the exact call or lock that the application is waiting for.

Cheers,

Allard

coderinabstract · September 6, 2011, 9:34pm

version 1.1.1

tried jstack as suggested…but cannot understand why the calling thread is parked (waiting). Cannot figure out what it is waiting for?. Problem is that in the command handler is is hanging on trying to load aggregate.

yes… tried it and everything seems to be looking ok… have attached two logs which show the jstack and the axon library debug log. The jstack shows the one instances of the thread waiting. the problem is that the web browser is async calling this on webserver and also hanging i.e. I see a Firebug rotating work in progress post for ever. Very challenging and hope to get some insight as have never seen this running within Eclipse embedded. Still cannot figure out why it works perfectly within Eclipse, however fails on Tomcat or Jettty standalone.

Also, the weird thing is that if I keep hitting the UI button, it eventually executes all the previous tries, however the UI webbrowser hung threads are always hanging in an infinite manner as observed from FireBug post work in progress. Very baffling.

Thanks for all your insight as this is a blocker for progress.

Cheers…

AxonHangingThreadApp.log (28.7 KB)

jstack.log (43.4 KB)

coderinabstract · September 6, 2011, 10:12pm

One more observation… even after repeated tries if a new thread succeeds and processes the same transactions multiple times for previous wait/failure, the waiting threads are still hanging i.e. hanging threads growing…

coderinabstract · September 7, 2011, 1:48am

Also realized that Eclipse forces everything to run on one thread as the thread number in the log within Eclipse concole is the same for everything… I think thats the case, however am at a serious impass and major concern. I will be happy to provide you any other info you may need as this is feeling like a issue with multiple thread management and a complex one.

Cheers…

Allard · September 7, 2011, 7:06am

Hi,

a colleague of mine had seen a similar issue before. They accidentally had an older version of Axon on the classpath as well. Are you 100% sure there is no older versions of Axon on the classpath, next to 1.1.1?

Did you use the “-l” (letter L) option on jstack? It is supposed to provide more information about locks.

Cheers,

Allard

coderinabstract · September 7, 2011, 1:27pm

Web-INF/lib has both the axon core and integration 1.1.1 jars in webapp. Thats all packaged by Maven in war and unzipped war and checked it in webserver.

Also, let me elaborate on use case which makes this aggregate unstable and causing this leaking of hung threads. It feels like the lock logic missing a thread/event in this scenario and then catching the pending events later, however not releasing the threads… something along those lines.

Aggregate A sends event which Saga A intercepts and sends command to Aggregate B which sends event intercepted by Saga B which sends Command to Aggregate A which sends event and Updater in query intercepts it and updates the query db. There is nesting happening here.

After all of this Aggregate A is not getting loaded. I have attached jstack -l output with more detail for reference.

…really am stuck with major issue here and no exception to warn of anything wrong/illegal in app code.

Thanks and cheers…

jstackWithL.log (28.1 KB)

Allard · September 7, 2011, 7:13pm

Hi,

I’ve got some good news. I managed to reproduce, locate and solve the issue.
Technical details: triple nesting of unit of work (which is what happened in your scenario) would not properly call the cleanup callback. And that’s where locks are being released.

I am building a 1.1.2 release right now. It will probably be available in maven central within a few hours.
Let me know if this also solves your issue.

Cheers,

Allard

coderinabstract · September 7, 2011, 9:31pm

Wow…Thanks… that was fast… I tried it and it worked.

Some complexity there… Quick question… Saw the diff… the test case test for 4 interaction nest… however the code change is recursive and should work theoretically for any amount of nesting? Is that a correct assumption?

Thank you again for the quick response and continued innovation on this framework.