Update: Fri 17 Sep 2021 09:31:45 PM MDT
During maintenance, a cluster node self-isolated, causing its workload to require reboots. The reboots are completed, and any VMs having lingering issues (a smaller subset) are being triaged by NOC staff.
Both events seemed to occur on the same node, so sensitive workloads are being adjusted to no longer use it.
We will fast-track migration of sensitive workloads to the Secondary Shared Hyper-V cluster, and make plans to switch over all workloads to it in the near future.
In the meantime, we will resume our no-change window of the primary shared hyper-v cluster and have Microsoft support revisit their attempt to identify the root cause of the trouble, announcing any maintenance that is required.
Update: Fri 17 Sep 2021 08:49:39 PM MDT
Maintenance and follow-up testing of the cluster is now underway. We will gather diagnostics and cease testing if we detect any issues.
Update: Fri 17 Sep 2021 06:24:13 PM MDT
Earlier today, we had another event on the Primary Shared Hyper-V cluster, this time on a new host. We've made some adjustments to the networking device that provides the layer-3 cluster vlan for this cluster, and will make further adjustments to hyper-v specific parameters before doing further and more extensive testing tonight to see if the issue persists.
There has at least been a marked decrease in the occurrences of outage incidents in response to mass live migration traffic since last night's maintenance.
Tonight's maintenance will begin at 8:30pm.
Conclusion: Thu 16 Sep 2021 08:49:48 PM MDT
Maintenance is complete. All objectives of maintenance were achieved with no production impact, and the cluster was tested and validated to no longer face instability after a large live-migration event (such as pausing a node for maintenance). We will continue to keep a close eye on this cluster overnight, but those with Highly-Available VMs should see no impact from here.
Thu 16 Sep 2021 08:03:06 PM MDT
The second part of this maintenance (removal of the problematic nodes from our hyper-v failover cluster) will begin on schedule, shortly.
Update: Fri 10 Sep 2021 07:58:34 PM MDT
The second part of the maintenance is now rescheduled to take place at 8:00 PM on the 16th, rather than 7:00 PM, to take the availability of Microsoft escalation support into account.
We still do not expect any issues from that maintenance, but will have them on standby as a precaution.
Update: Fri 10 Sep 2021 07:30:57 PM MDT
The first part of this maintenance is completed, without incident. We will follow up on Thursday the 16th for the next step.
Purpose of Work:
We will be making some after-hours changes with our primary shared Hyper-V failover cluster, which hosts a portion of highly-available VPS instances that customers without a dedicated private cloud may rely on.
The maintenance will have two phases:
On September 10, at 7PM MDT, we will be changing the cluster role's possible owners to exclude two hypervisors that have shown intermittent issues responding to calls from our backup appliance. We expect no impact from this event, beyond having backups work more reliably after the fact.
On September 16, at 7PM MDT, we will be removing those two hypervisors from the cluster to see if those same hypervisors are the cause of the occasional instability on the cluster that we've been working with Microsoft to resolve.
We have not had any of these events on this specific cluster since 8/12/2021. Now that a considerable amount of workload is on our secondary shared hyper-v failover cluster, we'll be seeing if we can eliminate even the possibility of these events.
Impact of Work:
Work will begin at 7PM (MDT) each night.
For both events, no impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages if the instability we're attempt to resolve is not fixed as a result of this maintenance.
If that is the case, we will implement mitigations immediately, possibly blending both maintenance events to try and prevent any more incidents while we're at it.
Any customer VMs that experience issues as a result of this maintenance will be recovered ASAP, with customers informed individually if their VM is going to experience a longer-than-reboot outage as a result of any events.
We will inform you when maintenance is complete.
Please contact us with any questions / comments / concerns.