Axon hanging on database inserts

Antonio_Mota · November 24, 2016, 11:50am

Hi all.

We are using Axon 2.4.3 for quite some time now in 2 different apps, the first has been running without problems for 4 or 5 months without problem, the other one was deployed in production 2 weeks ago and it’s giving a s**t load of problems. To go right to the point we have some batch jobs that in the end creates Axon processes. We are having problems when doing that, stalling connections to the database. We see lot’s of connections like this:

INSERT INTO axon.task_domain_event_entry (event_identifier, aggregate_type, aggregate_identifier, sequence_number, timestamp, payload_type, payload_revision, payload, metaData) VALUES ($1,$2,$3,$4,$5,$6,$7,XML($8),XML($9))

INSERT INTO axon.task_domain_event_entry (event_identifier, aggregate_type, aggregate_identifier, sequence_number, timestamp, payload_type, payload_revision, payload, metaData) VALUES ($1,$2,$3,$4,$5,$6,$7,XML($8),XML($9))
INSERT INTO axon.task_domain_event_entry (event_identifier, aggregate_type, aggregate_identifier, sequence_number, timestamp, payload_type, payload_revision, payload, metaData) VALUES ($1,$2,$3,$4,$5,$6,$7,XML($8),XML($9))
INSERT INTO axon.task_domain_event_entry (event_identifier, aggregate_type, aggregate_identifier, sequence_number, timestamp, payload_type, payload_revision, payload, metaData) VALUES ($1,$2,$3,$4,$5,$6,$7,XML($8),XML($9))
INSERT INTO axon.task_domain_event_entry (event_identifier, aggregate_type, aggregate_identifier, sequence_number, timestamp, payload_type, payload_revision, payload, metaData) VALUES ($1,$2,$3,$4,$5,$6,$7,XML($8),XML($9))
INSERT INTO axon.task_saga_entry(saga_id, revision, saga_type, serialized_saga) VALUES($1,$2,$3,XML($4))

This are going up to a number that stalls our database, in this instance 193 connections from Axon this morning. They were hanging for more than 1 hour when we decide to shutdown the database. There were 4 of those batch jobs that creates the Axon processes today.

11:30 - 20 processes
11:00 - 36 processes

09:30 - 19 processes

08:30 - 51 processes

To reach the 192 active connections it means some of them are opened for 3 hours or so… I checked the database for locks but it had only 2 that are actually there always.

The other application has similar jobs running every hour with around 50 processes created and we never had this problem. We searched for different configurations both on the database and on Axon but couldn’t find nothing specific.

Doe anyone had similar problems or can advise on what to search for?

Many thanks.

Antonio_Mota · November 24, 2016, 11:59am

Forgot to say, we are using a postgresql database with PRIMARY KEY (aggregate_identifier, sequence_number, aggregate_type) all of them are text.

Cheers.

Antonio_Mota · November 24, 2016, 4:28pm

Hi again, we just tested directly, without batch events, starting 2000 processes. Of those 250 were successfully written to the Axon tables and then it hanged with 150 connections.

Cheers again.

Allard · November 24, 2016, 8:03pm

Hi Antonio,

I recommend updating to 2.4.5 and try to reproduce it there.
How do you connect to the database? Using a connection pool? How large is the pool?
Also check the locking configuration of the database. Some databases use page-level locking by default. Using row-level locking is highly recommended.

Hope one of these pointers helps.
Cheers,

Allard

Antonio_Mota · November 25, 2016, 10:01am

Hi Allard, thanks for your reply.

We did upgrade to 2.4.5 yesterday in our UAT system and we were able to reproduce the problem. However, as it in seems those INSERTS were actually done and it was suggested that those connections stay there in “idle” mode because nobody closes it after the insert. They will eventually die after some 10 minutes, but the problem is that the number of connections can quickly reach the limit (we have 500 in UAT and 200 in Production).

Does this rings any bell to you? Axon explicit close the connections? I wrote some code on top of Axon persistence, I’m going to check that.

Thanks for your help.

Antonio_Mota · November 25, 2016, 11:24am

You are using a Hikari CP with 200 pool size and a time out of 2 minutes.

Patrick_Haas · November 28, 2016, 1:59am

Hi Antonio,

I don’t have any specific solution to this problem, but I do have a fair bit of experience with Axon-based applications, so I can give some general guidance in the hopes that it may give you some new paths for debugging.

One of the things that I love most about Axon is that it forces clean separation of application components (aggregates, domain vs views, core logic vs ancillary processes). This has also been the hardest part to learn for me – other application design approaches are much more forgiving, and let you get away with a lot of workarounds (until it inevitable becomes unmaintainable).

Since this is a new application, my first guess would be that some of the interactions between different parts of the application are causing a deadlock on the java side of things. I think the database would be more “vocal” if there was a transaction-level deadlock (one of the connections should get killed), but the database has no visibility to the Java mutexes that may be held.

In our very first axon design, we’ve had a tremendous issue with deadlocks since the aggregates were too fine grained. There was a lot of aggregate-to-aggregate communication, and we used Nested Units Of Work. A single top-level UOW might look like this:

T1: Command -> [ Aggregate 1 -> Event -> Event Handler -> Command -> [ Aggregate 2 -> Event ]]
Simultaneously, another command might look like this:

T2: Command -> [ Aggregate 2 -> Event -> Event Handler -> Command -> [ Aggregate 1 -> Event ]]

Note that T1 first acquires a lock on Aggregate1, then attempts to lock Aggregate2 in the nested UOW. The other thread T2 acquires the same locks in the reverse order. Both transactions will have some events flushed to the database, but can’t ever commit.

This is pretty easy to spot with a debugger:

Wait for the application to stall
kill -3 (or collect a thread dump using jvisualvm/debugger/etc)
Examine all of the threads, look for calls to the LockingRepository and similar.
Track down which commands are being handled by those threads
A similar deadlock can occur when the event handler itself attempts to acquire a lock: The Saga repository does enforce locking, so if you have two aggregates that send events to the same two sagas each, you can run into the same deadlock.

If you do run into this (like we did), a possible solution is to never use Nested Units of Work. Instead we’ve used two strategies:

When an event handler needs to send a command, we force that interaction to be asynchronous.
Option 1: Use an Async command bus. Keep in mind that dispatch is not guaranteed. Note that the sender MUST NOT wait for a result
Option 2: Use a (short lived) Saga. This is, after all, one of their purposes. See #2.
Option 3: Use a ‘durable’ async command gateway. The gateway will
Journal the command to a database table
Wait for transaction commit (This ensures commands aren’t “lost”)
After transaction commit, dispatch the command1. When a saga needs to send a command, NEVER send it in the same transaction. Instead:
Schedule a timer for “0 ms” (Using a scheduler backed by the UoW transaction, e.g. Quartz using JDBC)
Wait for the current UOW to commit the timer
When the timer is fired, a new UoW is created for the saga to handle the scheduled event. Now, dispatch the command

I’ve done a fair bit of performance testing with axon 2.4.x, most recently using PostgreSQL, and I haven’t seen any database related issues like this. I’m afraid the most likely issue is that you’ve overlooked some interaction between aggregates, views, sagas and other event handlers.

Hope this gives you some new insight and leads you to tracking down the root cause!

~Patrick