So as mentioned here, we’ve had some issues with a not so graceful shutdown after a cluster upgrade within a GKE maintenance period.
So let’s first talk about what happened (as far as I can tell). We run Axon Server SE in a Google Kubernetes engine. These clusters have maintenance windows in the weekend, during which it will do node pool upgrades. From what I can tell from the documentation, it will do a kubectl drain of a node. The drain command will wait 30 seconds by default until a pod is shut down, before forcing the issue.
My thinking is that the issue we got is caused by a shut down taking longer than 30 seconds. If this is the case, the solution would be to increase the default terminationGracePeriod to something longer than 30 seconds.
So what I’m asking is, what time would it normally take for axon server to shut down gracefully? Is there a recommended shutdown period?
could you share the logs of the final moments of Axon Server? That should indicate if it had started the shutdown process and maybe give some insight in what it was waiting for…
There is no specific logging on shut down; what I see is apps disconnecting and connecting, of which at least a part can be explained by these deployments ‘node hopping’ as well. I masked the names, but it are several different micro services disconnecting and connecting.
As discussed with @Marc_Gathier, Axon Server 4.5.2 contains some improvements with regard to graceful shutdowns. We’re going to upgrade, and we hope to no longer have issues in this regard.