JDBC connection exceptions appear to be solved by using newFixedThreadPool Executor for AsynchronousCommandBus

David_M · August 22, 2019, 10:38pm

Greetings,

We are running Axon 4 with Spring boot. The database is MySql 5.7, and we are using mysql-connector-java to connect to it.

The ORM is hibernate, by way of Spring Data JPA. Hikari manages the database connection pool.

Deployment-wise, we have two instances in the cluster.

Configuration-wise, we are using the AsynchonousCommandBus with the SpringTransactionManager implementation.

We have been getting intermittent exceptions in production stating the socket connecting to the database is unexpectedly closed, things more or less like:

Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 306788ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:676) at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:676)

at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:190)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:190)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:155)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:155)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128)

at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122) at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122)

at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:35)
at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:35)
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:106) at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:106)

I was able to re-create the environment and the exception locally by load testing, though I still don’t know what is causing the exception. Interesting note, this only happens when there are at least two instances of the application running. In this scenario, each request in the load test is for a unique record (unique aggregate id).

While debugging, I noticed that the default Executor implementation for the AsynchronousCommandBus is

Executors.newCachedThreadPool(
        new AxonThreadFactory(AsynchronousCommandBus.class.getSimpleName())
);

But similarly came across an article stating that the newCachedThreadPool can be considered harmful: http://dev.bizo.com/2014/06/cached-thread-pool-considered-harmlful.html

As a result, I changed the Executor implementation to

Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors())

And this appears to have fixed the problem (at least re-running locally).

Does anyone have any insight as to what/where to look for the root cause, or why a fixed thread pool seems to fix the issue?

Thank you,

David

David_M · August 23, 2019, 11:34pm

------------UPDATE

A new theory came to mind today. We are setting the Hikari pool connection size to 20, but the newCachedThreadPool “creates new threads as needed”. Also, it looks like Hikari throws an exception if it cannot get a connection for the caller. So what if the requests coming in are creating too many threads, causing a connection pool exhaustion that eventually creates the exception…

Steven_van_Beelen · August 26, 2019, 9:13am

Hi David,

Firstly, thanks for sharing that post, I wasn’t aware of it’s existence to be honest.
Quick search through the Axon projects learns that caching thread pool executor is only used for the Async and Disruptor command bus.
Thus, for threads dispatching and handling a command.
These operation should typically be short lived, which alleviates some of the concerns imo.
Regardless, the problem described in the blog might still arise; I’ll check with the rest of the team if we want to take a different stance on this.

Secondly, for your problem at hand, I was leaning towards the connection pool size myself too.
As your update states a similar concept, I feel we’ve found the culprit in this scenario.
You can thus either up the connection pool, are adjust the ExecutorService of the AsynchronousCommandBus; whichever you feel its your solution best.

That’s my two cents on the situation.

Cheers,
Steven

David_M · August 26, 2019, 1:29pm

Hi Steven,

Thank you for the response, very much appreciated.

The only piece of it that I still cannot reason away is why I can only seem to create this when running in a clustered mode. Try as I might, I am unable to re-create locally with just a single node.
Per this link, I enabled the leak detection in Hikari (https://github.com/brettwooldridge/HikariCP/issues/1111#issuecomment-373111306) and ran the load test locally to re-create the issue. It only ever appeared on the node that was the second one to the party, the first node chugged along happy and healthy:

15:08:29.796 [HikariPool-1 housekeeper] WARN com.zaxxer.hikari.pool.ProxyLeakTask - Connection leak detection triggered for com.mysql.cj.jdbc.ConnectionImpl@421259a3 on thread AsynchronousCommandBus-484, stack trace follows
java.lang.Exception: Apparent connection leak detected
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-3.2.0.jar:?]
at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:35) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:106) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:136) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getConnectionForTransactionManagement(LogicalConnectionManagedImpl.java:254) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.begin(LogicalConnectionManagedImpl.java:262) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.begin(JdbcResourceLocalTransactionCoordinatorImpl.java:236) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.hibernate.engine.transaction.internal.TransactionImpl.begin(TransactionImpl.java:86) ~[hibernate-core-5.3.9.Final.jar:5.3.9.Final]
at org.springframework.orm.jpa.vendor.HibernateJpaDialect.beginTransaction(HibernateJpaDialect.java:183) ~[spring-orm-5.1.6.RELEASE.jar:5.1.6.RELEASE]
at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:401) ~[spring-orm-5.1.6.RELEASE.jar:5.1.6.RELEASE]
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:378) ~[spring-tx-5.1.6.RELEASE.jar:5.1.6.RELEASE]
at org.axonframework.spring.messaging.unitofwork.SpringTransactionManager.startTransaction(SpringTransactionManager.java:59) ~[axon-spring-4.1.1.jar:4.1.1]
at org.axonframework.messaging.unitofwork.UnitOfWork.attachTransaction(UnitOfWork.java:272) ~[axon-messaging-4.1.1.jar:4.1.1]
at org.axonframework.commandhandling.SimpleCommandBus.handle(SimpleCommandBus.java:172) ~[axon-messaging-4.1.1.jar:4.1.1]
at org.axonframework.commandhandling.AsynchronousCommandBus.lambda$handle$0(AsynchronousCommandBus.java:89) ~[axon-messaging-4.1.1.jar:4.1.1]
at org.axonframework.commandhandling.AsynchronousCommandBus$$Lambda$2068.000000005C06C020.run(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:813) [?:1.8.0_212]

And it just appears that the thread (in this case AsynchronousCommandBus-484) is sitting on the connection longer than Hikari wants, because shortly thereafter everything resumes functioning as normal and threads are able to get DB connections. Later on in the logs we can even see where the thread finally returns the connection (just shy of 40 seconds later):

15:09:08.201 [AsynchronousCommandBus-484] INFO com.zaxxer.hikari.pool.ProxyLeakTask - Previously reported leaked connection com.mysql.cj.jdbc.ConnectionImpl@421259a3 on thread AsynchronousCommandBus-484 was returned to the pool (unleaked)

Granted this is still all within the context of local testing.

We are still not sure if it would be better to used the fixed thread pool, or increase the number of DB connections in the Hikari pool. Hikari’s page indicates that increasing connection pool size does not increase performance, and can actually be detrimental, which has made us a bit unsure about how best to proceed: https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing

Does anything jump out at you as to why the multi-node arrangement at this time is the only way I can seem to re-create?

Thanks much,

Steven_van_Beelen · August 28, 2019, 9:54am

Hi David,

Does anything jump out at you as to why the multi-node arrangement at this time is the only way I can seem to re-create?

To be honest, no, I do not fully comprehend why this is the case.
Maybe the test scenario you’ve written brings something to light.

A part from this though, I’ve worked at a client where we had set up Hikari and the AsynchronousCommandBus in a multi-node set up.
The pool size was set to 5 over there if I recall correctly, and without any issues you are encountering (sadly).

Any how, maybe your test scenario points to something obvious.

So if you’re able to share it, that could be worthwhile.

Cheers,
Steven