So, we are having some trouble with our tracking processors. I will try to describe it with the information I have been able to collect so far.
Our system is composed of several microservices, each of them with their query model and reading/writing from/to the event store as usual. We have a Hikari connection pool for managing connections to the database (a MySQL database in this case). Under certain circumstances, some of the services are not able to get a connection from the connection pool. This is a separate issue I am also trying to solve. However, this does not happen at the same time for all services, so we can end up in a situation where the rest of services are applying events but the affected service is stuck (as it cannot read from the event store). Eventually, the affected service would be able to get a connection, and here’s when we start having issues. The tracking processors living in that service get stuck and behind the event store. Events generated during the “down” period are not handled and new applied events after the connection is obtained are not handled either. We can see this warning in the logs:
2018-05-03 15:26:41.756 WARN 22 --- [dedEventStore-1] o.a.e.eventstore.EmbeddedEventStore : An event processor fell behind the tail end of the event store cache. This usually indicates a badly performing event processor.
Which lead us to this line of code. Apparently, the stuck processors are intended when the consumer falls behindGlobalCache.
So, I guess my question is what would be the intended way to gracefully recover from this. Any other information about the cause of the issue will also be super helpful, although I guess I have not provided enough details for anybody to know.
The only way we’ve managed to recover from this was by restarting the service. Upon restart, tracking processors are able to consume all events and get up to date.
Some help and/or thoughts would come in very handy!
Thank you very much in advance,