Transactions kept open by idle async saga manager threads?

Even when no requests are active, PostgreSQL shows “idle in transaction” status for almost all the connections in my connection pool, meaning there is no query currently running but the connections have open transactions. A thread dump confirms that no threads are in code that is actively using a connection (they are all waiting for work). I’m using an async saga manager with a Spring transaction manager and a ScheduledThreadPoolExecutor with a core pool size greater than the saga manager’s processor count. Everything is on one node at the moment.

What I think is happening: When an event arrives, all the idle threads in the saga manager’s thread pool wake up. They end up running AsyncSagaEventProcessor.doProcessEvent(), which calls prepareSagas(), which ensures there’s an active UnitOfWork, which begins a database transaction. One of the threads ends up handling the event, writing the saga to the database, committing its UnitOfWork, and releasing its connection back to the connection pool. So far so good.

The problem is that the remaining threads, each of which has already started its own UnitOfWork, have empty lists of processed sagas, so when they call persistProcessedSagas(), they end up skipping the code path that commits the active UnitOfWork. Those threads go back to waiting for more events but their UnitOfWork instances are still active and holding onto connections with open transactions.

This seems to get masked by the fact that when one of these threads processes another event, it will use the existing UnitOfWork rather than starting a new one thanks to the logic in ensureActiveUnitOfWork(). So the transactions do eventually get committed. That makes me think that maybe it’s by design to reduce event processing latency and so Axon isn’t constantly creating and destroying UnitOfWork instances in threads that end up not handling any events. However, PostgreSQL’s VACUUM command can’t run while there are open transactions, and with the behavior I’m seeing, there will essentially never be a time when there aren’t open transactions as long as the application is running, even when it’s sitting idle. It also means I have to increase the size of my database connection pool to account for idle saga threads holding onto connections.

Screenshot from IntelliJ’s debugger of the situation (waiting for an event with an active UnitOfWork): https://imgur.com/CGk8uLx

pg_stat_activity contents with an idle Axon application: http://pastebin.com/qdbZinas

Hopefully I’m just doing something stupid and this isn’t the expected behavior. Happy to post details of my configuration if that’s helpful. Obviously, take my analysis with a grain of salt since I’m not all that familiar with Axon’s internals, but the idle-in-transaction connections are for real.

-Steve

Hi Steve,

it seems that you’re not doing anything wrong. There seems to be a subtle connection leak (strictly, it’s not a leak) in the AsyncEventProcessor, used by the AsyncAnnotatedSagaManager. It starts a transaction, but will keep it open until a Saga is actually persisted. This behavior may result in transactions remaining open for a longer period of time, when an application is not used.

A fix for this was pretty easy. I’ll see if I can prepare a bugfix release that includes this (and potentially a few other) solutions.

Cheers,

Allard

Hi Allard,

Was this fixed? I spent some time tracking this down myself as we use HikariCP to pool the connections. It has a housekeeping job which periodically fires and is logging connection leak exceptions. I thought we were doing something wrong with the saga managers until I came across this post! It might also cause our Production Support team to panic a bit if they see things like this in the logs… If it’s benign then it would be good if I can use the ‘fixed’ version or suppress it.

Paul

Hi Paul,

I do recall fixing this, yes. There should be no more “escape routes” for connection to leak out. Steve is still active on this mailinglist, so hopefully he can confirm if the issue has been resolved properly.

Cheers,

Allard

The fix for this did indeed resolve the problem in our application.

-Steve

Thanks guys. I’ll try and track down which version it was resolved in, unless you know off the top of your head…

Paul