We are having repeated issues with Processors (two of them in particular) getting hung randomly in a 2 node deployment. New events still make their way to the domain_event_entry table, but the processors will end up unable to update their token.
QuartzDeadlineManager (non-clustered, I have tried to change the clustering in our QA environment, and that has not resolved the issue)
IntervalRetryScheduler (single thread scheduled executor)
The underlying DB is MySql Aurora. I did notice that the default transaction isolation level is REPEATABLE_READ instead of READ_COMMITTED. I am in the process of changing this and running tests to see if that is the issue.
In the meantime, I was able to re-create one of the hung processors in QA. It was due to a deadlock in the database. Running
show engine innodb status
yeilded the following:
—TRANSACTION 1206527554, ACTIVE 37 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 376, 1 row lock(s)
MySQL thread id 85499, OS thread handle 0x2ab972fc6700, query id 318671615 appuser statistics
select tokenentry0_.processor_name as processo1_26_0_, tokenentry0_.segment as segment2_26_0_, tokenentry0_.owner as owner3_26_0_, tokenentry0_.timestamp as timestam4_26_0_, tokenentry0_.token as token5_26_0_, tokenentry0_.token_type as token_ty6_26_0_ from token_entry tokenentry0_ where tokenentry0_.processor_name=‘MySagaProcessor’ and tokenentry0_.segment=0 for update
------- TRX HAS BEEN WAITING 37 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 417 page no 120 n bits 3 index
PRIMARY of table
token_entry trx id 1206527554 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 8; compact format; info bits 0
So the saga processor was trying to update its token, which caused a DB deadlock and hung that Processor for hours. It finally cleared out on its own, which I assume at this point is due to a pre-configured DB timeout.
The domain_event_entry table has a little over 1M entries at this point. My current working theory of the problem this: ProcessorA is on instance 1, and ProcessorB is on instance 2. ProcessorA goes to update its token entry, at more or less the same time that ProcessorB attempts to do the same. Then, because the transaction isolation level is set to REPEATABLE_READ, enough rows get locked on the domain_event_entry table by ProcessorA, that it causes ProcessorB's attempt to update its token to timeout (though I don't know why the timeout would take hours to clear out). Does this hypothesis seem plausible? Obviously, the select statement above is not doing any joins to the domain_event_entry table, but I can't understand why a simple update of the tracking token for a Processor would hang it for multiple hours.
There is another processor that is hung as well, but not due to a DB deadlock (at least the DB reports that there are no hung transactions). Regardless, it's token index refuses to update on the token_entry table, even though the timestamp is updating infrequently. Sadly, this is in production, so I don't have a JMX connection to see the status of the thread, but since the timestamp is infrequently updating I assume the thread cannot be completely blocked.
These issues only seem to creep up in a clustered evironment. Previously, Dev and QA were single instance only, and we never saw any of these issues (so it was never caught in testing). This issues are also pretty new, as we have been in production for about 4 months now, largely without issue. So, it would seem likely the problem comes from a combination of 1) clustering and 2) database size.
Is there a good way to diagnose Processor deadlock in Axon? Especially as it pertains to a clustered environment?
Any help is greatly appreciated,