Avoid re-ordering of events when using axon-kafka on multiple instances

Hi all,

we use axon-kafka to publish events from the event store to a Kafka topic, using the following configuration:

`
axon:
kafka:
clientid: ${STAGE:local}-${APPLICATION_NAME:processengine}-${HOSTNAME:localhost}

we intentionally provide no default values from here on to have a fail-fast behaviour

($ is not a legal character for a kafka topic name or a server URL)

defaulttopic: ${KAFKA_TOPIC_EVENTBUS}
producer:
retries: 5
bootstrapservers: ${KAFKA_BOOTSTRAP_SERVERS}
properties:
“max.in.flight.requests.per.connection”: “1” # Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled).
properties:
security.protocol: ${KAFKA_SECURITY_PROTOCOL_CONFIG}
sasl.jaas.config: ${TASKLIST_EVENTBUS_SASL_JAAS_CONFIG}
sasl.kerberos.service.name: ${KAFKA_SASL_KERBEROS_SERVICE_NAME}
event-processor-mode: “TRACKING”

`

The application runs in an OpenShift cluster with two replicas.

This worked pretty well so far, but a couple of days ago we noticed a problem for the first time that we could trace back to the ordering of events in the Kafka topic. To be specific, the events that should have been sent, in this order, are:

  1. Create

  2. Assign

  3. Complete

This is also what we see in the domain_event_entry database table. What we see in the Kafka topic is:

  1. Create
  2. Assign
  3. Complete
  4. Create
    with the Kafka CreateTimestamp of the last message being earlier than the one of the first message.

The only way we could think of how this could have happened is this:

  • Pod 1 tries to send the Create message and fails. Retry is triggered in the Kafka Producer

  • After 1 second, the axon-kafka adapter times out and reports an error to the TrackingEventProcessor, which goes into error mode and releases the tracking token

  • Pod 2 picks up the tracking token and retries sending the event. This retry is different from the one built into the Kafka Producer.

  • Pod 2 succeeds in sending the Create, Assign and Complete messages to Kafka

  • After that, the Kafka Producer running on Pod 1 retries sending the message and succeeds. The message appears in the commit log after the other ones sent by Pod 2.

Note that a duplicate Create message wouldn’t be a problem in our setup per se. It only becomes a problem because it appears after the other messages.

Now we started thinking about what we could do to avoid this behavior. The easiest solution that came to our minds would be to disable retries in the Kafka Producer (by setting axon.kafka.producer.retries to 0) and let the TrackingProcessor handle all the retrying. Does this generally sound like a good idea or is there anything else we need to consider?

Part of the problem is that the issue seems extremely hard to reproduce so we can only try to reason about it and might never know if we really solved it.

Thanks and Best Regards
Lars

Hi Lars,

I would indeed pick one of the retry mechanisms.
Either Axon’s (thus the one which comes with using the TrackingEventProcessor) or Kafka’s.

But as you stated, I’d be working on a hunch here too…

Haven’t seen this behaviour myself before and reproducing it sounds iffy.

That’s my two cents.

Cheers,
Steven

Hi Steven,

thanks for sharing your assessment! I tried setting the retries property to zero, but it did not have the expected effect. It seems that Kafka queues the message if it does not have a working connection to the broker and sends it as soon as the connection is up again without considering that a retry.

My solution is to set the publisherAckTimeout in the Kafka Publisher to a value higher than the delivery.timeout.ms property in the Kafka Producer (30 + 1 seconds in my case). This way, Kafka should time out the message before the publisherAckTimeout kicks in. As I said above, I can not test it properly, but I hope this should avoid the situation.

Still, I wonder if it wouldn’t be cleaner to cancel the pending message whenever the publisherAckTimeout expires. Looking at the code of Kafka, it seems that cancelling the Future has no effect, but maybe the producer could be closed and re-created whenever the timeout occurs? What do you think?

As I increased the publisherAckTimeout from the default 1 second to 31 seconds, I also wonder why the default was chosen this short and if my change might cause any problems?

Thanks
Lars