RSS Feed
Latest Updates
Jul
27

[Completion, Tue 27 Jul 2021 11:46:42 PM MDT]: Maintenance has concluded, with all required diagnostics gathered.  We were able to anticipate issues and move VMs to other nodes for all but one requested reproduction of the issue.  22 VMs had to be shut down and started again over the span of 5 minutes to recover in that case, but all VMs recovered in such a way were VMs with less sensitive workloads.  

Alerts for these VMs are clearing normally, and support will be able to reach me or the on-call engineer in addition to notifying any affect clients, if any alerts persist beyond what we'd expect from a normal reboot.



Purpose of Work:


We will be gathering diagnostics on our shared failover cluster overnight, while reproducing a specific issue that has a chance of affecting the workload of one of two member hypervisors (out of the seven in the cluster).

Sensitive workloads (such as specific mail servers, specific database servers, RDS hosts, or RAS hosts) will be moved away from the problem HVs prior to diagnostics being run, and will also be configured to avoid the two problematic nodes in the future, until this issue is resolved.

Said issue only occurs during mass live migrations (such as when a node is paused for updates), and thus will not occur during production hours under normal circumstances.

Diagnostics were requested by Microsoft Business Support, and will be relayed to their escalation personnel.


Impact of Work:

Work will begin at 9PM (MDT) tonight.

No impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages resulting in a read-only filesystem.

Any customer VMs that experience issues as a result of this maintenance will be rebooted for recovery purposes ASAP, with customers informed individually if their VM is going to experience a longer-than-normal outage as a result of any events.

We will inform you when maintenance is complete.


Please contact us with any questions / comments / concerns.


Read more »



Jul
20

[Update, Tue 20 Jul 2021 11:10:50 PM MDT] Mitigations recommended by the vendor did not have the desired effect; a subset of VMs had to be rebooted, but did recover normally with minor intervention.

We will continue to engage the vendor and announce further scheduled tests.  Work for tonight is concluded.


Purpose of Work:


We will be testing a new configuration with our primary Hyper-V failover cluster, which hosts our highly-available VPS instances that customers without a dedicated private cloud may rely on.

A new virtual network will be introduced to the cluster, configured to handle cluster and live migration traffic, and then several tests of cluster functionality will follow.

The aim of this configuration change (recommended by Microsoft support) is to mitigate occasional instability we've noticed this week when re-introducing a node to the cluster after a maintenance even on said node.


Impact of Work:

Work will begin at 9PM (MDT) tonight.

No impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages if the instability we're attempt to resolve is not fixed as a result of this maintenance.

Any customer VMs that experience issues as a result of this maintenance will be recovered ASAP, with customers informed individually if their VM is going to experience a longer-than-normal outage as a result of any events.

We will inform you when maintenance is complete.


Please contact us with any questions / comments / concerns.


Read more »



Jul
16
[Update 1] Emergency Maintenance for Primary Monitoring Server, July 16 2021
Posted by David Cunningham on 16 July 2021 02:54 PM

[Update, Sat 17 July 2021 1:12PM MDT]: Maintenance is complete.


[Update, Sat 17 July 2021 12:18 PM MDT]:
Our team is resuming work to expand storage resources within our monitoring infrastructure.

[Update, Fri 16 Jul 2021 08:32:16 PM MDT]:
Maintenance was aborted at 6:13PM, and we have been monitoring normally since then, resolving any pending alerts that we could not see during this maintenance.  We will continue this maintenance over the weekend, updating this post when a time is decided.



Purpose of Work:


Our primary monitoring server will be brought offline today, with data migrated to a higher capacity drive to account for the growing size of the database and mitigate the potential for emergency space conditions.


Impact of Work:

The automated monitoring that support staff relies on will be offline for an estimated period of 1 hour, starting at 4PM MDT.  

During this time we will not have any monitoring data for fully managed servers, with clustered websites being the exception in some cases (as some of those rely on another service in addition to our primary monitoring). 

Any trouble with your services can still be reported to the support team via helpdesk, email or phone number, as usual.  ([email protected], 303-414-6910 x2)

We will inform you when maintenance is complete.


Please contact us with any questions / comments / concerns.


Read more »



Jul
13
Purpose of Work:
July's patch Tuesday has arrived, and while the PrintNightmare vulnerability and its out-of-band patching by Microsoft certainly made the most news (https://helpdesk.handynetworks.com/supportsuite/index.php?/News/NewsItem/View/306/emergency-out-of-band-security-patching-for-fully-managed-windows-2012-servers----july-8-2021), it's not the only vulnerability of note this month.

First up, it's important to note that the PrintNightmare patches themselves ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-34527 ) have had more than a few revisions, even since we last patched.  These may already be applied as part of automatic update schedules for some hosts, but not all of them.  Applying these revisions is of course a good idea.

Secondly, we have another RCE vulnerability that is already being exploited 'in the wild', affecting Server 2008 and up: https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-34448.  This RCE vulnerability requires user interaction, as it leverages the scripting engine (which requires that you run a script in the first place).  As such, the hosts most likely to be targeted by it are RDS hosts and user workstations.  It also does not involve automatic escalation of privilege, as PrintNightmare did.

The third vulnerability of note is yet another RCE vulnerability affecting Server 2008 and up, this time leveraging the DNS Server role: https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-34494.  This would normally be in the top 2, but since it's not being exploited actively in the wild, the timetable for getting this patched is more forgiving.  That said, it's still a DNS service RCE exploit, and as such, is going to be automatically escalated since it will run code in the user context of a privileged service (basically giving whoever does it admin access).  Microsoft says that the privileges required to do this are 'low' instead of 'none', so it's possible it will not be wormable, but we'll be proceeding on the assumption that, like most RCE vulnerabilities targetting the DNS Server role in the past, it will be.

The forth on the list is a Kernel RCE vulnerability affecting Server 2016 and up, leveraging the Hyper-V role: https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-34458.  This being a Kernel vulnerability affecting a privileged service, it will automatically run any code in the System user context, meaning automatic privilege escalation.  The vulnerability also demonstrates the ability to spread attacks laterally through an entire Hyper-V environment, allowing guests to be a vector towards attacking other vulnerable guests, or even the hypervisor itself, meaning the vulnerability is wormable.  It's worth noting that this vulnerability is not yet confirmed to be exploited publicly, and that it only seems to affect Hyper-V hosts that are exposing hardware to guests via SR-IOV.  To my knowledge, we do not do this in any of our Hyper-V environments, but in case the vulnerability has more factors to it than are being shared at this time (as was the case with PrintNightmare...), we're applying it just to be safe.


Impact of Work:


All affected hosts that are 2012 and up will be rebooted automatically / ASAP to propagate fixes, starting at 11:00PM, with some exceptions.

Internal systems on Windows 2012 and up (such as the management portal) may be temporarily impacted in the time it takes to reboot them.  Mail delivery to our helpdesk may be temporarily halted while our mail servers are updated as part of this patch cycle.  If you receive a delivery failure, you can still reach us by logging directly into the helpdesk and submitting a ticket directly via the portal, or calling us at 303-414-6910 x2, for emergencies.  


Hypervisors in a failover cluster will have rolling reboots done, in order to eliminate VPS downtime on said clusters.  Hypervisors not in a failover cluster will either be updated overnight, or have their updates scheduled, depending on customer policy / VM density.

Any hosts where updates are managed directly by the customer (or an approval process is required for zero-day updates) will not be impacted; the controlling organizations will be notified separately.


Please contact us with any questions / comments / concerns.
Read more »



Jul
8
Purpose of Work:
We will be patching all 2012+ hosts to resolve the "PrintNightmare" zero-day vulnerability.  Said vulnerability is an RCE exploit that (based on the findings of independent security researchers) seems to have varying effects depending on security policy, applications and patching level, but the effects range from Escalation of Privilege to fully elevated Remote Code Execution.  https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-34527

This vulnerability was made public before Microsoft had a patch out, so it has been exploited in the wild since Friday.  Initial mitigations were put in place over the weekend; now that Microsoft is releasing their official fix for it, we'll be applying it ASAP.


Impact of Work:


All affected hosts that are 2012 and up will be rebooted automatically / ASAP to propagate fixes, starting at 7:00PM, with some exceptions.

Internal systems on Windows 2012 and up (such as the management portal) may be temporarily impacted in the time it takes to reboot them.  Mail delivery to our helpdesk may be temporarily halted while our mail servers are updated as part of this patch cycle.  If you receive a delivery failure, you can still reach us by logging directly into the helpdesk and submitting a ticket directly via the portal, or calling us at 303-414-6910 x2, for emergencies.  


Hypervisors in a failover cluster will have rolling reboots done, in order to eliminate VPS downtime on said clusters.  Hypervisors not in a failover cluster will either be updated overnight, or have their updates scheduled, depending on customer policy / VM density.

Any hosts where updates are managed directly by the customer (or an approval process is required for zero-day updates) will not be impacted; the controlling organizations will be notified separately.


Please contact us with any questions / comments / concerns.
Read more »