[Completion] Emergency Maintenance for Shared Hyper-V Failover Cluster, October 22, 2021
Posted by David Cunningham on 22 October 2021 05:46 PM
Completion, Fri 22 Oct 2021 09:25:05 PM MDT: This maintenance appears to have been a success: we can no longer reproduce the issues we were seeing during mass live migration traffic, during our testing. All changes were made without production impact.|
We'll be performing some updates on the cluster tonight for the purpose of further testing, and observing carefully to ensure this is the case: this would not be the first time initial testing gave us the impression the issue was resolved.
Update, Fri 22 Oct 2021 08:30:21 PM MDT: We are now proceeding with this maintenance, and will provide further updates as appropriate.
Update, Fri 22 Oct 2021 07:00:08 PM MDT: This maintenance will be postponed for the moment, pending the completion of a pre-maintenance backup. We'll update this thread when maintenance begins.
Purpose of Work:
We will be making some after-hours changes with our primary shared Hyper-V failover cluster, which hosts a smaller proportion of highly-available VPS instances that customers without a dedicated private cloud may rely on.
First, a single node will be paused and drained of its workload, after which time some network interface options will be changed on said node.
Secondly, the storage network will be set to handle cluster communication, while the cluster communication network is adjusted. Cluster communication will then be handled by the CC network, again.
Once this is complete, a stress test of the primary cluster will ensue, with sensitive workloads moved away from problem hosts.
Impact of Work:
Work will begin at 7PM (MDT) tonight.
No impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages if the instability we're attempt to resolve is not fixed as a result of this maintenance.
If that is the case, we will implement mitigations immediately, possibly blending both maintenance events to try and prevent any more incidents while we're at it.
Any customer VMs that experience issues as a result of this maintenance will be recovered ASAP, with customers informed individually if their VM is going to experience a longer-than-reboot outage as a result of any events.
We will inform you when maintenance is complete.
Please contact us with any questions / comments / concerns.