My team has put off clustering since we've been wondering about the same issues. We also have the same issues of corporations and standardized deployments, but fortunately being forced to cluster wasn't one of them We'll have to deal with this in the future though as we want HA/DR.
We've been experimenting with your idea of using an interceptor to write commands to a log (and delete the log on tx commit) with some success. We use this for AsynchronousCommandBus -- it's the same problem whether clustering or merely async, on the command side at least. If something goes wrong, we can "retry" the command later. (So far the retry is not automatic.) Part of the problem is to assume the server can just "go down" at any moment. When shutting down a server, can you bleed off users and flush in-progress commands? If so, then maybe a "crash" becomes something you can detect and use to trigger retry from the command log.
If you've already informed your user their command "will" be processed, well, is that part of the problem too? Commands are asynchronous, yes, but can you wait for a command success callback before updating the UI? If so, and if you can mostly rely on shutting down cleanly, then maybe the whole issue is moot. Worth asking yourself.
As for your questions about duplicate event handling, I'm not sure I understand how if you're using a JMS queue you could handle an event twice. A queue would only deliver each message once. That said, why do you need to cluster event handling at all? It seems like there are two ways you can design it -- and yours is not one of them. (Granted I have not personally tried either of these yet, but my team has put some thought into it.)
1) Use JGroups command bus (as you are), and since you have the same WAR deployed everywhere, do NOT distribute event handling using Spring Integration. Events will be handled on the same instance that handled the command, which is a good thing for caching, locking etc. as well since the command bus will already route commands for the same aggregate to the same instance. (Distributing events over a JMS queue will kill your caching and expose yourself to out-of-order and other race conditions as each event will be delivered to a random cluster member.) This also reduces load on JMS, and avoids the need for transaction synchronization between DB and JMS (are you using XA/2PC? Do you trust it?)
2) Use pub/sub (JMS topics) for event distribution and designate each server to run a mutually exclusive subset of event handlers. For example, one server generates the UI views, and the other generates reports. They're both connected to the same database of course, so either server could serve UI requests. I don't think this makes sense in your case but I hope it illustrates that the reason you'd want to distribute events over JMS doesn't seem to jibe with your scenario.
There's another question of how to distribute Sagas that neither of us have addressed. Are you using sagas?
Hope that helps. I'm *very* interested to hear if you're successful in clustering!