we use axon-kafka to publish events from the event store to a Kafka topic, using the following configuration:
we intentionally provide no default values from here on to have a fail-fast behaviour
($ is not a legal character for a kafka topic name or a server URL)
“max.in.flight.requests.per.connection”: “1” # Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled).
The application runs in an OpenShift cluster with two replicas.
This worked pretty well so far, but a couple of days ago we noticed a problem for the first time that we could trace back to the ordering of events in the Kafka topic. To be specific, the events that should have been sent, in this order, are:
This is also what we see in the domain_event_entry database table. What we see in the Kafka topic is:
with the Kafka CreateTimestamp of the last message being earlier than the one of the first message.
The only way we could think of how this could have happened is this:
Pod 1 tries to send the Create message and fails. Retry is triggered in the Kafka Producer
After 1 second, the axon-kafka adapter times out and reports an error to the TrackingEventProcessor, which goes into error mode and releases the tracking token
Pod 2 picks up the tracking token and retries sending the event. This retry is different from the one built into the Kafka Producer.
Pod 2 succeeds in sending the Create, Assign and Complete messages to Kafka
After that, the Kafka Producer running on Pod 1 retries sending the message and succeeds. The message appears in the commit log after the other ones sent by Pod 2.
Note that a duplicate Create message wouldn’t be a problem in our setup per se. It only becomes a problem because it appears after the other messages.
Now we started thinking about what we could do to avoid this behavior. The easiest solution that came to our minds would be to disable retries in the Kafka Producer (by setting axon.kafka.producer.retries to 0) and let the TrackingProcessor handle all the retrying. Does this generally sound like a good idea or is there anything else we need to consider?
Part of the problem is that the issue seems extremely hard to reproduce so we can only try to reason about it and might never know if we really solved it.
Thanks and Best Regards