Distributed Command Bus with guaranteed delivery

aldibella · January 12, 2016, 2:21pm

Hi,

The out-of-the-box implementation of the CommandBusConnector for Distributed Command Bus is based on JGroups and as discuss in other threads it only offers best effort delivery and it’s not compatible with cloud deployment (at least in Cloud Foundry).

Has anybody implemented or found a third party implementation of the CommandBusConnector designed for higher reliability?

Thanks

Alessandro

Allard · January 13, 2016, 3:12pm

The best reliability is when the sending component is able to retry a command when it fails or times out. Even in the JGoups connector, the callback will be invoked when the message can’t be delivered, or the node where a message was delivered drops before sending a reply. That invocation allows you to either initiate a retry, or do whatever you think is suitable in that situation.

Cheers,

Allard

aldibella · January 13, 2016, 10:23pm

Hi Allard,

I partially disagree. Retrying is certainly a good tool to have but it could very easily lead to exhaustion of resources without a circuit breaker.
Imagine a system with high volume of transaction where one of the services is off for a few hours.
A typical retry policy with an exponential back-off algorithm (2,4,8 seconds up to 1 minute) would quickly create a large number of sleeping threads or elements in waiting queue.
A circuit breaker would prevent a system collapse by inhibiting the retry but the command is still lost. Also, an in-memory retry mechanism would not survive a system restart.

In my opinion better results are achieved by using an application designed for resilient message delivery. Tools like RabbitMQ or AcitiveMQ offer a large set of configuration options do deal with failed deliveries like dead letter queues, routing, bouncing, etc.

Retries attempts can be configured to span from seconds to days without any detrimental effect on the producers.

Regards,

Alessandro

Allard · January 14, 2016, 12:56pm

I don’t see the disagreement. All I mentioned was retrying. I didn’t mention that it should be done in-memory or in a persistent manner. Obviously, for long-term retries, persistent is safer. Generally, I advise a single retry in case of a transient exception that may have been caused by topology changes.

An a message to anyone reading this: please, please don’t ever, ever, EVER, block a thread for a retry.

aldibella · January 14, 2016, 2:35pm

Then I agree :-). Thank you for your feedback