Reconnect to Database fails

Michael_Dempfle1 · May 26, 2020, 8:31am

Hi,

We had the problem before that reconnects to the Axon server failed if the Axon server was restarted. This seems to be solved with axon framework 4.3.2.

Now we had a reconnect problem with the database itself. Our mysql database was down for a couple of minutes on prod and on qa it was down for 4 hours.

The following error can be seen when you go to the tracking processors:
Processor in error state: org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is org.hibernate.TransactionException: JDBC begin transaction failed:

Restarting the tracking processors in the deshboard make the system working again. We are using the standard version and there it does not happen automatically.

Questions:

Is there a configuration to fix this behavior - like that there is a endless retry? It does not help e.g. if after 10 retries it is marked as failed.
Is the Enterprise version monitoring this and restarting the tracking processors automatically?

Best, Michael

allardbz · May 26, 2020, 1:08pm

Did you restart the processors or the entire application?

Michael_Dempfle1 · May 26, 2020, 2:36pm

I only restarted the processors.

This time this helped. But sometimes we need to restart also the whole service. We don’t have a real pattern right now.

Best, Michael

allardbz · May 26, 2020, 2:51pm

Honestly, I don’t think this is an Axon issue. Axon just uses (in this case) Spring to manage the transactions and connections (using Hikari). It looks like the connections that are taken out of the Hikari pool are unchecked and may have failed. Hikari can be configured to do a health check on connections when taking them out of the pool and to refresh them if they have become stale.

When a transaction fails, Axon’s TrackingProcessor will release the segment that it is processing and schedule a retry. The time between retries doubles each time, with a maximum wait time of 60 seconds. If you have multiple versions of an application deployed, it should allow other applications to claim that segment and proceed, if they manage to get access to the database. However, if there are sufficient stale connections in the pool, it’s easy to get a long retry timeout to get them all out of the way.

Hope this helps.
Cheers,

Michael_Dempfle1 · May 28, 2020, 7:54am

Hi Allard,

Thanls for the feedback.

What do you mean with “However, if there are sufficient stale connections in the pool, it’s easy to get a long retry timeout to get them all out of the way.”?

And also is there an improvement when using the enterprise version? Like would this be automatically detected and fixed? As it already does automatic re-balancing.

Best, Michael