Tracking processor falling behind the event store

So, we are having some trouble with our tracking processors. I will try to describe it with the information I have been able to collect so far.

Our system is composed of several microservices, each of them with their query model and reading/writing from/to the event store as usual. We have a Hikari connection pool for managing connections to the database (a MySQL database in this case). Under certain circumstances, some of the services are not able to get a connection from the connection pool. This is a separate issue I am also trying to solve. However, this does not happen at the same time for all services, so we can end up in a situation where the rest of services are applying events but the affected service is stuck (as it cannot read from the event store). Eventually, the affected service would be able to get a connection, and here’s when we start having issues. The tracking processors living in that service get stuck and behind the event store. Events generated during the “down” period are not handled and new applied events after the connection is obtained are not handled either. We can see this warning in the logs:

2018-05-03 15:26:41.756 WARN 22 --- [dedEventStore-1] o.a.e.eventstore.EmbeddedEventStore : An event processor fell behind the tail end of the event store cache. This usually indicates a badly performing event processor.

Which lead us to this line of code. Apparently, the stuck processors are intended when the consumer falls behindGlobalCache.

So, I guess my question is what would be the intended way to gracefully recover from this. Any other information about the cause of the issue will also be super helpful, although I guess I have not provided enough details for anybody to know.

The only way we’ve managed to recover from this was by restarting the service. Upon restart, tracking processors are able to consume all events and get up to date.

The EmbeddedEventStore will give you that warning if the EventConsumer can not longer tag along with the global event stream.

This global event stream is used to minimize the number of open database connections needed to the domain event entry table.

In such a scenario, the EventConsumer thread will automatically switch to a local event stream.
It should thus not mean you’d have to do anything specific within Axon, as the consumer will proceed on it’s own.

From your response I however get the assumption it just shuts down any further event processing.
Is this actually the case? Have you been able to confirm that whilst that warning popped up your TrackingEventProcessor were still receiving events?

The reason you’re receiving this error is very likely because (assumption incoming!) your TrackingEventProcessor tries to update a view, requires a connection for that from Hikari, but since it doesn’t get any it waits until it receives one.

The waiting for that connection takes so long however, that the EventConsumer for the given TrackingEventProcessor is fallen behind compared to the others.

As said, it will (or should, if this is not the case in your application) now consume events in a separate thread.

Anyway, our system has evolved a lot since I asked this question, and now we are not having this problem anymore, thankfully!