Hi dear Axon-Community
i’ve got a brainteaser for you on this wonderful monday morning:
We sometimes encounter a problem with long running querys on our database.
We see significantly increased iops resulting in high load and accordingly slow processing and high cost. This happens about once every second day but very irregularly and mostly in the evening or on the weekend.
When this happens we see “hanging” querys on the database, usually the client_address is that of on of our read-services on the reader-instance of the database. By hanging i mean the state is ‘active’ or ‘idle in transaction’ and its been running unusually long. We found those with queries like “… FROM pg_stat_activity WHERE state = ‘active’ …”. We don’t see indications that we have a deadlock of some sort.
A manual cancellation of the querys and/or restart of the service fixes the issue momentarily. When nothing is done manually, the issue fixes itself after about 8 hours - our guess is this might be due to some standard timeout.
Technical background:
Our services are running in aws, axon-based services in ec2 or fargate with a postgress (engine version 13.8) as event store
the services are springboot 2.6.6 applications built with axon-framework in version 4.7.2
The problem was first observed a short time after upgrading to db engine 13.7, upgrade to 13.8 was done automatically but didn’t improve things.
Do you have any ideas why this might happen or some suggestions what we should further look into? Or did someone encounter a similar problem?
We are currently trying to increase logging, find out specifics on these queries and are looking into Postgress specifically - in the meantime any help is greatly appreciated.