Saga's and Event Processors hanging in clustered deployment

David_M · September 30, 2019, 4:33pm

Greetings,

We are having repeated issues with Processors (two of them in particular) getting hung randomly in a 2 node deployment. New events still make their way to the domain_event_entry table, but the processors will end up unable to update their token.

Using:

TrackingEventProcessor
SimpleCommandBus
QuartzDeadlineManager (non-clustered, I have tried to change the clustering in our QA environment, and that has not resolved the issue)
IntervalRetryScheduler (single thread scheduled executor)
SpringBoot
Hikari
Hibernate

The underlying DB is MySql Aurora. I did notice that the default transaction isolation level is REPEATABLE_READ instead of READ_COMMITTED. I am in the process of changing this and running tests to see if that is the issue.

In the meantime, I was able to re-create one of the hung processors in QA. It was due to a deadlock in the database. Running

show engine innodb status

yeilded the following:

—TRANSACTION 1206527554, ACTIVE 37 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 376, 1 row lock(s)
MySQL thread id 85499, OS thread handle 0x2ab972fc6700, query id 318671615 appuser statistics
select tokenentry0_.processor_name as processo1_26_0_, tokenentry0_.segment as segment2_26_0_, tokenentry0_.owner as owner3_26_0_, tokenentry0_.timestamp as timestam4_26_0_, tokenentry0_.token as token5_26_0_, tokenentry0_.token_type as token_ty6_26_0_ from token_entry tokenentry0_ where tokenentry0_.processor_name=‘MySagaProcessor’ and tokenentry0_.segment=0 for update
------- TRX HAS BEEN WAITING 37 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 417 page no 120 n bits 3 index PRIMARY of table MyDatabase.token_entry trx id 1206527554 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 8; compact format; info bits 0

So the saga processor was trying to update its token, which caused a DB deadlock and hung that Processor for hours. It finally cleared out on its own, which I assume at this point is due to a pre-configured DB timeout.

The domain_event_entry table has a little over 1M entries at this point. My current working theory of the problem this: ProcessorA is on instance 1, and ProcessorB is on instance 2. ProcessorA goes to update its token entry, at more or less the same time that ProcessorB attempts to do the same. Then, because the transaction isolation level is set to REPEATABLE_READ, enough rows get locked on the domain_event_entry table by ProcessorA, that it causes ProcessorB's attempt to update its token to timeout (though I don't know why the timeout would take hours to clear out). Does this hypothesis seem plausible? Obviously, the select statement above is not doing any joins to the domain_event_entry table, but I can't understand why a simple update of the tracking token for a Processor would hang it for multiple hours.

There is another processor that is hung as well, but not due to a DB deadlock (at least the DB reports that there are no hung transactions). Regardless, it's token index refuses to update on the token_entry table, even though the timestamp is updating infrequently. Sadly, this is in production, so I don't have a JMX connection to see the status of the thread, but since the timestamp is infrequently updating I assume the thread cannot be completely blocked.

These issues only seem to creep up in a clustered evironment. Previously, Dev and QA were single instance only, and we never saw any of these issues (so it was never caught in testing). This issues are also pretty new, as we have been in production for about 4 months now, largely without issue. So, it would seem likely the problem comes from a combination of 1) clustering and 2) database size.

Is there a good way to diagnose Processor deadlock in Axon? Especially as it pertains to a clustered environment?

Any help is greatly appreciated,

David_M · September 30, 2019, 5:25pm

The exception messages in the logs for the other Processor are:

`
“org.hibernate.engine.jdbc.spi.SqlExceptionHelper SQL Error: 1062, SQLState: 23000”
“org.hibernate.engine.jdbc.spi.SqlExceptionHelper Duplicate entry ‘-1756868226’ for key ‘complete_hash’”

"org.hibernate.internal.ExceptionMapperStandardImpl HHH000346: Error during managed flush [org.hibernate.exception.ConstraintViolationException: could not execute statement]
`

Clearly a duplicate key violation based on the MySql State codes. I am reviewing the code to see why such an error would be happening, but I still don’t understand why such an error would prevent the token on the token_entry table from updating.

David_M · October 1, 2019, 2:13pm

The output of show processlist on MySql shows that the state of the thread doing the query is statistics. Online searches indicate that this means the query is doing a lot of disk-bound IO, which doesn’t make any sense as this select for update should be almost instant…

David_M · October 1, 2019, 2:29pm

My apologies for continuously deleting my posts, its just that Google Groups does not provide the ability to edit an existing post for a correction. As such all I can do is delete it and re-post.

Steven_van_Beelen · October 2, 2019, 11:52am

Hi David,

Firstly, no worries for the spamming.

We can still follow your train of thought as it is.

When it comes to your issue, I’d definitely adjust the transaction isolation level from REPEATABLE_READ to READ_COMMITTED.

As Allard for example stated in this reply here, adjusting this should resolve the potential for concurrent inserts of a given event seqNo-aggregateId pair in the store.

As you’ve likely noticed, the aggregate identifier - sequence number pair is required to be unique, so that Axon is able to re-hydrate your Aggregate based on it’s events in the right order.

Having the transaction isolation level on REPEATABLE_READ would to me thus also explain why you’re seeing the duplicate key violation you’ve mentioned earlier.

So, would you mind adjusting the transaction isolation level and check whether the exceptions are still occurring?

Cheers,
Steven

David_M · October 2, 2019, 5:09pm

Hi Steven,

Good to hear from you again.

I changed the transaction isolation level to READ_COMMITTED, but unfortunately was still able to re-create the error (but still only in a clustered environment). I will continue to use the READ_COMMITTED transaction isolation level however, since it is the one recommended for Axon. But alas, I am still at a loss as to what is locking the record for this processor on the token_entry table. This might sound strange, but is there anything special with regards to the name of the processor? This particular processor actually begins with the letter ‘A’, and is the only processor that is seizing (out of a total of about 9). Would there be an situation in with the first processor in a clustered environment might lock, as the instances get started and fight over processor ownership?

I have to apologize regarding the error from the second processor (the one suffering the duplicate key violation). Because both of these processors (one starting with ‘A’, and the duplicate key one) started seizing at more or less the same time, I thought they were suffering from the same problem. Events in the last 24 hours have shown me that this is not the case, and as such I have been able to resolve the issue with the duplicate key processor.

Thank you again Steven,

David

Steven_van_Beelen · October 3, 2019, 9:11am

Hi David,

No worries about the second-processor error.
I fully understand that you’d pair both problems if they occur in the same time; don’t we all make such assumption at times?

To circle back to your question about the event processor naming, I can be pretty clear.
Axon does not perform any specifics when it comes to the names you choose for your Event Processor and Processing Groups.
It just uses the name to figure out where to place your Event Handling Components in.
Nothing stops you from trying a different name to see whether the outcomes differs, but to be honest it would surprise me if that resolved the problem you’re encountering.

Let me ask you something different to get a grasp of your set up.

What is it this exact “Token-Table-Locking Event Processor” does upon event handling?
Simple read model updates? External service calls? Sends SMS’?

Let’s try to figure this out David!

Cheers,
Steven

David_M · October 3, 2019, 4:03pm

Hey Steven,

The “Token-Table-Locking Event Processor” sets up deadlines and queues up work for the “work-horse” aggregate in the system. It doesn’t have to make any off-JVM calls of any kind. Really it just acts as a traffic cop to get work to the work-horse.

So things got a little bit more interesting this morning. I was trying to re-create the error again today in QA (which now has two nodes in the cluster, lacks cluster configuration for Quartz, and DB is still set for REPEATABLE_READ). I queued up a couple thousand events for processing in the work-horse, and then pushed the trigger to let them all loose at once. This caused a flood of events to be sent out. Afterwards, when I checked the token_entry table, I find that all the Processors whose threads are on the “leader” node in the cluster are up to date (meaning that the index of their GapAwareTrackingToken is equal to the latest global_index value in the domain_event_entry table). Meanwhile, the indexes for the Processors whose threads are on the “follower” node got stuck, and haven’t moved. Their GapAwareTrackingToken index is about 4,000 events behind. But their timestamp continues to get updated, so the threads on that node are still alive and kicking.

DB-side, show engine innodb status shows no DB locks of any kind, nor do I see any errors coming out of the TrackingEventProcessor related to DB transactions. In fact, there are no errors in the logs of any kind.

A threaddump sadly yields nothing of interest. The stack trace on the working threads of each node in the cluster look the same (regardless of the index of their GapAwareTrackingToken):

`

{
“methodName”: “park”,
“fileName”: “Unsafe.java”,
“lineNumber”: -2,
“className”: “sun.misc.Unsafe”,
“nativeMethod”: true
},
{
“methodName”: “parkNanos”,
“fileName”: “LockSupport.java”,
“lineNumber”: 215,
“className”: “java.util.concurrent.locks.LockSupport”,
“nativeMethod”: false
},
{
“methodName”: “await”,
“fileName”: “AbstractQueuedSynchronizer.java”,
“lineNumber”: 2163,
“className”: “java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject”,
“nativeMethod”: false
},
{
“methodName”: “peekGlobalStream”,
“fileName”: “EmbeddedEventStore.java”,
“lineNumber”: 373,
“className”: “org.axonframework.eventsourcing.eventstore.EmbeddedEventStore$EventConsumer”,
“nativeMethod”: false
},
{
“methodName”: “peek”,
“fileName”: “EmbeddedEventStore.java”,
“lineNumber”: 350,
“className”: “org.axonframework.eventsourcing.eventstore.EmbeddedEventStore$EventConsumer”,
“nativeMethod”: false
},
{
“methodName”: “hasNextAvailable”,
“fileName”: “EmbeddedEventStore.java”,
“lineNumber”: 333,
“className”: “org.axonframework.eventsourcing.eventstore.EmbeddedEventStore$EventConsumer”,
“nativeMethod”: false
},
{
“methodName”: “processBatch”,
“fileName”: “TrackingEventProcessor.java”,
“lineNumber”: 344,
“className”: “org.axonframework.eventhandling.TrackingEventProcessor”,
“nativeMethod”: false
},
{
“methodName”: “processingLoop”,
“fileName”: “TrackingEventProcessor.java”,
“lineNumber”: 269,
“className”: “org.axonframework.eventhandling.TrackingEventProcessor”,
“nativeMethod”: false
}

`

These inconsistencies in reproducibility just leads me more to think that this is due to some misconfiguration on how we are clustering.

Do you have any recommendations regarding proper configuration for the following in a clustered environment:

TrackingEventProcessorConfiguration
EventBus
PlatformTransactionManager/SpringTransactionManager

Thanks again Steven,

David

David_M · October 3, 2019, 4:08pm

Oh and to be clear, the “Token-Table-Locking Event Processor” was on the “leader” node this time, so its GapAwareTrackingToken did not get locked or stuck in the past. It is all caught-up and ready to process…

Steven_van_Beelen · October 14, 2019, 10:27am

Hi David,

For distributed set ups, we typically suggest the following approach:

Use a dedicated distributed routing solution for your messaging components. This could be Axon Server for ease of use as it’s a one-stop-shop for commands, events and queries. If you cannot use Axon Server for some reason, you will have to resort to a Distributed Command Bus implementation (either JGroups or Spring Cloud) for your commands. For events, you can either use AMQP/Kafka to distribute your events or share the Event Store among your instances. For simplicity, I’d suggest the latter.
Tracking Event Processors which are duplicated because of duplication of a given application should always use the same token_entry table. Without sharing the token_entry table, they will have no means to communicate which processors is processing which part of the Event Stream.
As Axon does not do any distributed transaction management, the auto configured SpringTransactionManager should be just fine within a given JVM.

I have one return question to you David. What do you mean with “leader” node in your set up?

Do your application instances have some election algorithm in place?

When it comes to work division among Tracking Event Processor, I’d let the Tracking Token logic in place in the framework take that job.
The framework will ensure by updating the token claims that distinct threads on (potentially distinct) a Tracking Event Processor(s) do not duplicate event handling logic accidentally.

A part from that, I’d like to circle back to what you’re using this Saga for.
Essentially, you are setting up deadlines for a “big” aggregate if I understand correctly.

What scale does this Aggregate cover and what role does it have?

A potential warning sign of to large an Aggregate, which I’d like to call a “system-Aggregate”, is if it is handling commands for virtually every part of the application.

This will make it a big congestion point, as all parts of the application will have to transaction this system-Aggregate’s consistency boundary to perform an operation.

Added, if it comes to scheduling operation for a specific Aggregate instance, I’d suggest to make the Aggregate instance itself in charge of doing so.
Axon Framework to that end provides the DeadlineManager, which allows you to schedule a deadline in an Aggregate directly.
This will simplify the loop you have, as you are effectively eliminating the Aggregate-Saga-Aggregate loop by using it.

Quite some info in here, hope it guides you on the resolution-path David.
Know that if this problem breaks your production environment and the matter is very pressing, it’s quicker to contact us at AxonIQ directly instead of this user group.

Cheers,
Steven

David_M · October 24, 2019, 7:20pm

Hi Steven,

Sorry for the delayed response, and thank you again for your feedback.

We do not have any election algorithm in place for designating a “leader” node. When I say “leader” what I am referring to is that, in our two node deployment, one of the nodes stands up first, and so that node claims ownership of most or all of the processors (so all the processor threads are active on that node, and those threads move back and forth between the running or waiting state). The “follower” node is the second one to stand up, by which time all the processor’s have been claimed by the “leader” node, so most of the time all of the processor threads on the “follower” are in the sleeping state, since the “leader” has already claimed ownership.

As for the architecture we have implemented, I did not go in to a lot of detail about it in the last post, because I did not think that the Saga index locking that we are seeing had much to do with it, but allow me to elaborate a little more. We have two important aggregates. There is a BillableAggregate, which represents a unit of billable work, keeps track of its own deadline for billing, and is obligated to be part of an Org. When the deadline expires, it is added to the BillingAggregate, and there is one BillingAggregate for every Org. So, there could be many BillableAggregates in the BillingAggregate that manages the billing for org “ABC”. The conversation for moving BillableAggregates into and out of the BillingAggregate is facilitated by the Saga (the one that keeps locking).

The BillingAggregate is an opinionated queuing system for billables of that Org. It enforces the rules, and decides which of the BillableAggregate that have reached their deadline get billed, and when.

I’m not sure if the BillingAggregate is a “system-Aggregate”, but it can have a lot of work on its plate, especially if many BillableAggregates reach their deadline at around the same time. The actual billing takes place outside of the application, so the request is sent off-process asynchronously, and we get a response message back when billing has completed.

Sadly, this continues to be an issue in production, and so we are having to act as the BillingAggregate at the moment. We have an override in place so that we can manually trigger the billing.

I have started to research options for distributing the command bus. I will have to start creating local POC’s to see if I can get it to work.

That said, the warning messages in the logs are almost always something related to the SQL connection, or the Hikari pool, which still makes me think we have something misconfigured.

Thanks again,

David

David_M · October 29, 2019, 10:32pm

Some additional updates after looooong day of trying to diagnose the issue:

Despite the fact that we have had the BillingAggregate turned off in production for almost a month now, the PessimisticLockException continues to happen in the Saga that facilitates its processes. One other Saga has also had this issue, but of the 8537 times this exception has been thrown in the past two months, over 99.998% have come from the BillingAggregate’s Saga.

When the PessimisticLockException is thrown, it is thrown once per minute, for various hours at a time. As such the exception logs tend to be concentrated in certain time intervals, and then silent thereafter.

I have been somewhat able to re-create the exception locally, but only in a very forced way, with direct intervention. If I run two nodes locally and put a break-point on the JpaTokenStore.extendClaim(String, int) method, and leave it there for 2-3 minutes, I will eventually see the same exception. I am getting that method to run by setting the owner on the token_entry table to NULL for the processor, which triggers the running of the TrackingEventProcessor.WorkerLauncher.run() method. Curiously, if I explicitly lock the token_entry table by issuing lock tables token_entry write; via command line, I do not get the same exception: I can see that the processing threads stop being able to write to the table, but they just patiently wait for the table to be available again.

Turning on Hibernate sql statement debugging, I have also noticed that whichever thread owns the processor updates its timestamp on the token_entry table via an UPDATE ... command (meaning it tries to get a table-level lock). The other node (that does not own the processor) is constantly trying to do an update of the same record, but via a select …. for update (so it’s attempting to perform a row-level lock). I have torn thru the Axon source code, and thus far have not been able to figure out what/who would be creating this select … for update query, since the JpaTokenStore explicitly uses the UPDATE command. I can see that the select … for update comes from the staticLoadQuery of the org.hibernate.loader.entity.plan.AbstractLoadPlanBasedEntityLoader, but that is as far as I have been able to trace it. Nevertheless, this select … for update does not appear to break anything (at least as near as I can tell). The LockModeType on the select ... for update query is PESSIMISTIC_WRITE.

The top portion of the stack trace is:

“thread”: “EventProcessor[BillingSagaProcessor]-0”,
“level”: “INFO”,
“loggerName”: “org.axonframework.eventhandling.TrackingEventProcessor”,
“message”: “An error occurred while attempting to claim a token for segment: 0. Will retry later…”,
“thrown”: {
“commonElementCount”: 0,
“localizedMessage”: “could not extract ResultSet”,
“message”: “could not extract ResultSet”,
“name”: “javax.persistence.PessimisticLockException”,
“cause”: {
“commonElementCount”: 16,
“localizedMessage”: “could not extract ResultSet”,
“message”: “could not extract ResultSet”,
“name”: “org.hibernate.PessimisticLockException”,
“cause”: {
“commonElementCount”: 16,
“localizedMessage”: “Lock wait timeout exceeded; try restarting transaction”,
“message”: “Lock wait timeout exceeded; try restarting transaction”,
“name”: “com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException”,

Back to the codes…

allardbz · November 11, 2019, 3:50pm

Does your TokenEntry table have the proper indices registered? This should be done by default, but you never know. If for some reason it performs a table scan, there might just be reverting to table locking, instead of row locking.

We did see something similarly happen on MSSQL. That appeared to be an inconsistency in how String are represented in MSSQL and how the JDBC client sent parameters across. Due to this transformation, it would simply ignore any indices that had been defined.

Hope this gives you some clarity.
Unfortunately, I’m completely in the dark as to what could be causing this.

Cheers,

David_M · November 12, 2019, 5:18pm

Greetings,

I don’t think this is due to an indexing issue. The TokenEntry table is very small, and even a full scan would take microseconds at most.

At this point I am also not so sure it is due to a clustering configuration. I was finally able to create the same exception by running a single instance locally, and then issuing the following command to the MySQL instance from my terminal:

BEGIN; SELECT * FROM token_entry WHERE processor_name = 'BillableAggregateSaga' FOR UPDATE;

And then just waiting 50 seconds for the innodb_lock_wait_timeout limit to be reached. This re-created the issue identically, and continued to produce the exception once every 50 seconds til I killed my terminal session.

This would indicate that, through some combination of logic, the processor is attempting to update its token by creating a transaction, and then sitting on the transaction forever. Eventually the transaction gets forcefully killed by the MySQL server when the transaction hits the wait_timeout of 8 hours. This would also indicate then that there is an application-side deadlock that the DB cannot detect.

The ``BillableAggregateSaga has an interval timer in it to refresh the quota of things that it can bill in a given time period. It is currently using the QuartzDeadlineManager to create this interval functionality, by scheduling a new deadline every time an existing deadline is reached. Though I still have not been able to naturally re-create the issue, I am starting to think that there is some edge-case cyclic dependency where Axon is trying to lock the BillingAggregate for event A, and then also trying to lock the BillingAggregate for event B, and then somehow there is a dependency between events A and B, so that neither one can advance, and both wait indefinitely for each other to be created and/or committed.

The above explanation seems plausible, only that I have not been able to re-create the exception this way thus far. I'm working at getting some thread dump code into production at this point, since that is the only place where we have ever seen the issue (Murphy's law, this problem never happens in Dev or QA, nor have we been able to re-create it there).

allardbz · December 5, 2019, 9:50am

Hi David,

did you manage to reproduce this in a way that you could get a thread dump?

Do you by any chance use sendAndWait() anywhere? The EventProcessor holding a lock isn’t a strange thing. All processing is always done while holding a lock on the entry (which is done to prevent another processor thread from claiming that segment when processing takes longer than the claim timeout).

Cheers,

David_M · December 16, 2019, 3:26pm

Hi @Allard

Sorry about the late reply, I didn’t see any notification that you had responded.

I extended the JpaTokenStore and monitored calls to the fetchToken and storeToken methods, trying to listen for the Processor being locked again.

I was able to do so, and got a stack trace of the thread while it was “locked”:

“EventProcessor[BillingSagaProcessor]-0” Id=49 RUNNABLE (in native)
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

locked java.lang.Object@6bb165f1
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
…

Number of locked synchronizers = 2

java.util.concurrent.locks.ReentrantLock$NonfairSync@7b9e215e
java.util.concurrent.locks.ReentrantLock$NonfairSync@1b604d31

So the Processor’s thread is not blocked, but it is locked trying to read from the SSL socket that connects to the DB, while another thread is also doing the same. The threads cannot read from the same connection at the same time obviously, but I still cannot explain why the other thread would not finish reading from the connection, so that other threads could continue normal processing.

I looked in our code, and we do actually use sendAndWait in one place. It is used when we are sending the billing request to a JMS topic, but by a separate saga (not the saga that is blocking). This JMS request is handled by the service that actually charges the client’s CC.

Would that be the culprit?

David_M · December 16, 2019, 3:30pm

To add a little more detail about the sendAndWait question, we are sending the billing command to the Publish object that has the reference to the JMS topic. This method has a ‘void’ return type obviously, since we cannot know when the billing service will respond.

allardbz · December 23, 2019, 1:28pm

Hi David,

This looks like a Thread that’s reading data from an SSL connection. It’s RUNNABLE, so it’s not waiting for any lock to free up. It has all the locks it needs.

To be able to really see what’s happening, I’d need to see the AxonFramework entries in the stack trace as well.

Whether it’s the JMS connection that causes things to hang is uncertain. For deadlocks, at least 2 stacktraces are needed to do a proper analysis of what’s going on.

Cheers

David_M · December 23, 2019, 4:29pm

Hi Allard,

Thank you for the reply.

I will try and get the complete thread dump soon. The default thread dump from the ManagementFactory object sadly only gives the first 8 frames of the stack trace. I will have to deploy different code that gets the full trace. This will not happen til January at this point however, due to the Christmas holiday here in the US.

Is there something important to know about the sendAndWait() method? I did not add it, and there is no note as to why it was added in the first place. Looking at the code, I cannot see why we would need to sendAndWait for a message being sent to a JMS topic.

Also, would you have a good recommendation for detecting when one of the Processor threads has become non-responsive, since the thread state is not BLOCKED? I don’t know if the Axon framework has a more native way of meeting this need than my implementation.

At the moment my method of detection has been to extend the JpaTokenStore. Then I have overridden the fetchToken and storeToken methods to store the processorName in a static ConcurrentHashMap<ProcessorName, Timestamp> (which in reality is a <String, Long>) before calling super.fetchToken(). Then every five minutes I check the map to see if there are any processors that have not updated their timestamp in over five minutes (which is pretty generous, since they are invoking the fetchToken and storeToken methods once per second).

Thanks again for the help,

David

David_M · December 27, 2019, 9:06pm

Though I do not recommend this solution for long term application, as it seems to have a small memory leak, that takes up to two weeks to become apparent.