Long running querys on Aurora Postgress Event-Store

flauber · March 20, 2023, 12:34pm

Hi dear Axon-Community
i’ve got a brainteaser for you on this wonderful monday morning:

We sometimes encounter a problem with long running querys on our database.
We see significantly increased iops resulting in high load and accordingly slow processing and high cost. This happens about once every second day but very irregularly and mostly in the evening or on the weekend.
When this happens we see “hanging” querys on the database, usually the client_address is that of on of our read-services on the reader-instance of the database. By hanging i mean the state is ‘active’ or ‘idle in transaction’ and its been running unusually long. We found those with queries like “… FROM pg_stat_activity WHERE state = ‘active’ …”. We don’t see indications that we have a deadlock of some sort.
A manual cancellation of the querys and/or restart of the service fixes the issue momentarily. When nothing is done manually, the issue fixes itself after about 8 hours - our guess is this might be due to some standard timeout.

Technical background:
Our services are running in aws, axon-based services in ec2 or fargate with a postgress (engine version 13.8) as event store
the services are springboot 2.6.6 applications built with axon-framework in version 4.7.2
The problem was first observed a short time after upgrading to db engine 13.7, upgrade to 13.8 was done automatically but didn’t improve things.

Do you have any ideas why this might happen or some suggestions what we should further look into? Or did someone encounter a similar problem?
We are currently trying to increase logging, find out specifics on these queries and are looking into Postgress specifically - in the meantime any help is greatly appreciated.

Gerard · March 20, 2023, 1:56pm

Hi Flauber,

I’m not sure this is the reason, but since you mention it only happens occasionally, could it be related to a vacuum process running? I did read from here that toast might have a negative impact on vacuum performance:

“Vacuum performance: PostgreSQL runs a process called “vacuum,” which reclaims disk space from deleted or updated rows to maintain the database’s performance. The vacuum process could be slow when there are a lot of large data objects stored in TOAST tables. To address this issue, try running the vacuum process during periods of low database activity, or consider using a storage solution optimized for handling large data objects such as file systems or object storage.”

Typically when using JPA with Axon and Postgres all the event payloads are stored using TOAST. So it could be that disabling it, be following the instructions in this blog post might improve things.

flauber · March 20, 2023, 2:36pm

Hi Gerard,

First of all thank you for your quick answer!
Unfortunately, we are already running a solution similar to the one given in the blog post.

Gerard · March 20, 2023, 3:26pm

I don’t have any other suggestions. Hopefully, you can supply some additional information soon.

flauber · March 29, 2023, 10:50am

Upon further investigation it seems like the db engine upgrade might not be the issue, as we discovered similar looking spikes in IOPS from before the upgrade. Overall load has increased so that may be why we see the issue more regularly now.