Tracking event processor is idle most of the time, keeps pausing for five to six minutes before progressing

pvanleuveren · December 28, 2017, 1:12pm

Hi,

We are using tracking event processors in our solution to eventually process all events that occur in our system. What we observe is that some of those processors are very very slow. They do process the events, but they only process about 100 events and then become idle for about five or six minutes before processing the next few events. The token is not updated in the database.
We investigated trace logging, but also this is idle for this period of five to six minutes. Literally nothing happens in the meanwhile.

TrackingEventProcessors in other applications against the same database perform well.

We are using Axon 3.0.5 in Spring boot configuration and use a GapAwareTrackingToken. We do have a lot of gaps: about 500.000 in a total event set of about 6.500.000.

We have configured MAX_GAPS to be 1000; but in the token we see a little more than 100 gaps. As batch-size we use Axon’s default.

An application that works well easily catches up and then overtakes the non-performing tracking event processor. Therefore we expect that this is not due to the events that are actually stored; or the gaps that we encounter.
Also, we have been introducing new tracking event processors in our solution more often and that did not give these incidents before. We employ to temporary re-process the events to enhance our query and viewmodels. As soon as they are finished we remove them again. So currently, there is only a single tracking event processor active in the system that causes the issues.

How can we solve this and make all tracking event handlers run fast?’

Kind regards,

Patrick

allardbz · December 28, 2017, 3:56pm

Hi Patrick,

what is it doing when sitting idle? Is it just polling for events and not receiving any? Of it is waiting on a lock?

There has been a report of an issue where a Tracking Processor would believe there are no events to process when the number of consecutive gaps is larger than the batch size. This would also only occur when using the JDBCEventStorageEngine.

Can you at least see the “claim” on the token being refreshed when a Tracking Processor hangs? This would mean that it’s not receiving events, but at least is “alive”.
Cheers,

Allard

pvanleuveren · December 28, 2017, 5:36pm

Hi Allard,

Our process is truly idle. It does nothing. There are no errors logged: neither at the beginning of the processing nor at the end of the processing. Nothing happens and nothing special is logged.

The tokenentry table is not updated in this period and also, there is no lock on this table. When the next ‘mini-batch’ is processed, the token is updated, but the owner remains unchanged.

We have seen the other issue and checked this as well. There are no gaps beyond the index of the token we checked. This was a consecutive globalindex number.

Kind regards,
Patrick

allardbz · December 29, 2017, 6:55am

Hi Patrick,

a thread can’t be really doing nothing. It must be either doing something useless, or it’s waiting on something. I assume from your description that this is reproducible? If so, could you share two thread dumps taken a few seconds apart. If you can isolate the thread owned by the processor, that would be great. Maybe I can see something based on the dumps.

For verification: which Axon version are you using?
There is one condition I can think of where a processor will go into wait mode: when it fails at acquiring a Token, because it is already claimed by another node. It compares the claim timestamp with the system time, and will “steal” the claim when it is older than 10 seconds. If your systems clocks are too far apart, a processor might think that it can’t yet claim the token. Can you confirm that the owner of the Token’s claim (mentioned in the table with the tokens) is actually the name of the JVM that the blocking processor is in?

Cheers,

Allard

pvanleuveren · December 29, 2017, 9:25am

Hi Allard,

Thanks for your response. What I assume is that the Thread is in a wait state; or perhaps suspended. I do not see any activity on this thread. We did not make a dump yet, so we really cannot tell now.

The situation is that we have a single instance of the application running. That comprises our CommandHandlers and Subscribing Event Handlers as well, and they work well. Here we added a TrackingEventProcessor and that one is failing. In our other apps, that only have a TrackingEventProcessor and do not contain CommandHandlers are running fine. Except for one, that sometimes fails and sometimes works well. Setup of these apps is similar.

Since we have a single instance I suppose will be no issue in claiming or stealing the token.

Some observations we made:

This is the only app that has more than one row in the TOKENENTRY table. The other rows are leftovers from Tokenprocessors that we used to have but are no longer active.
The name of the failing processor is longer than the ones that are running fine. (threshold is typically around 26 characters).
It processes some events (typically around 100 - 200) and then freezes for about five minutes. This is too structural to be an incident. I have also seen it processing a few batches; as the timestamp is updated more often than once before it freezes.
The application is quite busy during the day, so we restarted it at a more quite moment, but this did not make a difference either.
Nothing really explains the behaviour, but this is what we see.

What is really awkward, is that last night -without any change or redeployment- the Tracking Event Processor started working in a fast way. When we saw this in the morning, the gaplist in the token was empty -which to me is unexpected- and upto now it performs well and also contains a longer gaplist at the moment. We cannot explain why it did not work (while the rest of the applications did) and now we cannot explain why it started working. This makes it hard to make predictions on the performance of our platform.
Because of this autohealing we cannot reproduce the issue at the moment.

It brought back some memories on our first steps, when we sometimes witnessed our Tracking Event Processor to take a 15 minute break before continuing sometimes. Also than this was autonomous behaviour of the application. We used Axon 3.0.2 or 3.0.3 at that moment. Now we are using Axon 3.0.5.

Migrating to 3.1.1 is on our backlog.

Cheers,
Patrick

Ramon_de_Beijer · December 29, 2017, 9:45am

i think the stealing token was actually reported by us

allardbz · December 29, 2017, 10:51am

Hi Patrick,

a colleague of mine pointed me to a situation where this happened to them in a project. We have managed to write a unit test to simulate and confirm this behavior.

There are multiple application involved. The event producing application produces more types of events than the event consuming application knows about. In other words, the consumer doesn’t have all the event classes the producer has.

When the consumer reads a batch of events, and this batch only contains “unknown” event types, these types are ignored, but the token is not updated. As long as no “known” events are encountered while reading, the token is not updated. The timestamp on the token is refreshed, though.
So while the processors looks like to have stopped mid-stream, somewhere, it is actually reading (and waiting) at the head of the stream, but not updating its token. As soon as an event is available that the processor does know, it continues processing that event. Even if the number of unknown events exceeds the batch size.

If this occurs again, could you check the following:

thread dump to see what the threads are up to
check if timestamp on the token is updated, even if the token itself isn’t
owner mentions the name of the JVM that runs the processor
assert that any events since the stored token are known to the JVM that runs the processor

Let me know if any of the assumptions above are also valid for your case.
Cheers,

Allard

Ramon_de_Beijer · December 29, 2017, 11:06am

Hi Allard,

To answer on of the questions, the answer is no

“check if timestamp on the token is updated, even if the token itself isn’t”

the token and its timestamp is updated every 5 or 6 minutes
But according to the problem you described, would a empty @EventHandler method solve the problem then? so the trackingprocessor “handles” the event

regards Ramon

Steven_van_Beelen · December 29, 2017, 11:14am

Hi Ramon, Patrick,

An empty or Object event parameter type in an @EventHandler annotated function will not solve the issue, as the parameter type does not specify the type Axon will use to serialize the event.
It will still use the payloadType as present in the Event Message to serialize the message its payload (fyi: the payload is the parameter which is supplied to your @EventHandler annotated function).
As it does not know the payloadType, it will not serialize the payload and that specific event will not be put in the stream of events provided to the TrackingEventProcessor to handle.

This thus also means the token will not be updated.

Apart from that, we’re not quite certain why the token its timestamp would be updated only every 5-6 minutes.
Thus I’d suggest the same as Allard requested if you see this behavior occurring again.

Cheers,

Steven

Ramon_de_Beijer · December 29, 2017, 11:43am

Hi Steven,

not sure we mean the same, but i meant beside the @EvenHandler methods for every event we DO process, adding an @EvenHandler for every event we dont process but with no code so like the code below.
Is that also what you meant?
Or do you mean the situation that the processor doesnt know one (or more) of the specific event class(es) in its classpath? This is not the case in our situation.

regards Ramon

@Component
@ProcessingGroup(“OurTrackingViewProcessor”)
public class OurTrackingViewProcessor {

@Autowired
private SomeService someService;

@EventHandler
public void handle(final AnInterestingEvent event) {
someService.doStuff(event.getSomeStuffFromEvent());
}

@EventHandler
public void handle(final AnotherInterestingEvent event) {
someService.doSomeOtherStuff(event.getSomeStuffFromEvent());
}

@EventHandler
public void handle(final AnNonInterestingEvent event) {
// NO-OP
}

@EventHandler
public void handle(final AnotherNonInterestingEvent event) {
// NO-OP
}
}

Steven_van_Beelen · December 29, 2017, 12:27pm

Hi Ramon,

Ah, you’re right there, I was pointing to the situation where you’ve got two different Axon applications, publishing events to one event store which are distinct for either on of them. Thus, they’re not aware of each others event classes.

A part from that though, it will still not make a difference in your situation if you add the no-op event handlers for those events.

The TrackingEventProcessor will receive the stream of events and check whether any of it’s Event Handling Components (in your situation, the ‘OurTrackingViewProcessor’ class) is interested in the event.
If none of its Event Handling Components is interested in the no-op event, it’ll still update the token, as the TrackingEventProcessor did its job to handle that event.

Cheers,

Steven

allardbz · December 29, 2017, 12:36pm

The fact that the processor doesn’t update timestamps makes me believe it is busy on something outside of the processing loop itself. In the processing loop, there is a query for events and, if no events arrive within 1 second, the timestamp is updated and the cycle starts over.

What DBMS are you using?
We have also seen a situation at a customer where an implicit type conversion in the MS SQL driver caused a table scan for all queries (see https://blogs.msdn.microsoft.com/sqlcat/2010/04/05/character-data-type-conversion-when-using-sql-server-jdbc-drivers/). While a table scan would normally just slow a query down, when doing many updates, it would basically block the entire table to only allow a single thread at a time.

Anyway, a ThreadDump should show where the thread is blocking. Until then, we could only be guessing.

Cheers,

Allard

Ramon_de_Beijer · December 29, 2017, 1:21pm

Hi Allard/Steve,

we are using Oracle (11 I think), for now it works again so guess if it happens again we will make a threaddump to see what is happening
And if i read your comment about the jdbc driver (although its another DBMS) i think it leaves us with one possible hint were to look

cheers Ramon

pvanleuveren · December 30, 2017, 9:36am

All,

Thanks for all the help and hints. Still there is no explanation for the behaviour we see.

What we’ll do is to update to the most recent version of Axon and then see how the platform behaves. If we encounter this behaviour again, we now have some useful tips to better explore what’s going on.

Regards and all the best for 2018,

Patrick