RSS Feed
News
Sep
14
Purpose of Work:
September's Patch Tuesday has arrived, and as usual, there's enough vulnerabilities to justify day 1 overnight patching.


First off, there's a zero-day RCE vulnerability leveraging the ActiveX controls in the MSHTML feature, affecting Windows server 2008 and up ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-40444 ).  This vulnerability requires minimal user interaction, and will execute code in their own user context.  It's currently being exploited in the wild, using office documents with malicious web content as the delivery mechanism for malicious payloads.  Microsoft released this patch out-of-band last week, as such.  We've patched most remote desktop environments for this already: general server patching will follow, tonight.

Secondly, there's a zero-interaction RCE vulnerability leveraging the Windows WLAN Autoconfig service, and affecting Windows server 2008 and up (https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-36965).  As we are a datacenter, Wifi is not in use on our server workload, so this patch will be applied only incidentally, as part of the monthly rollup.  However, it's worth announcing, as it is wormable, as long as there is a rogue or infected host on a wifi network where devices that are running this service are connected.  Organizations running mobile workstations should take notice.

Third off, there's a memory corruption vulnerability leveraging the Windows Scripting Engine that affects Windows Server 2008+ ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-26435).  This vulnerability requires user interaction, either via opening a file, or a webpage with a malicious file embedded in it.  It not currently detected in active exploitation.  In general, memory corruption vulnerabilities require more creative exploits to be leveraged successfully.

Forth, there are various elevation of privilege vulnerabilities, leveraging several roles (and one kernel vulnerability), affecting server 2008+ ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-36974, https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-40447, https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-38671 ).  EOP vulnerabilities in general are a can of worms on any webserver, since a compromised website can easily turn into a compromised server.

Finally, Microsoft has disclosed several Elevation of Privilege vulnerabilities for various system components on Windows 2008 and 2008 R2 ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-36968 https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-38625 https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-38626 ).  There are a great demonstration as to why it's important any 2008 or below hosts are upgraded to 2012+; said patches are not available without ESU licensing.


Impact of Work:


All affected hosts that are 2012 and up will be rebooted automatically / ASAP to propagate fixes, starting at 9:30PM, with some exceptions.

Internal systems on Windows 2012 and up (such as the management portal) may be temporarily impacted in the time it takes to reboot them.  Mail delivery to our helpdesk may be temporarily halted while our mail servers are updated as part of this patch cycle.  If you receive a delivery failure, you can still reach us by logging directly into the helpdesk and submitting a ticket directly via the portal, or calling us at 303-414-6910 x2, for emergencies.  


Hypervisors in a failover cluster will have rolling reboots done, in order to eliminate VPS downtime on said clusters.  Hypervisors not in a failover cluster will either be updated overnight, or have their updates scheduled, depending on customer policy / VM density.

Any hosts where updates are managed directly by the customer (or an approval process is required for zero-day updates) will not be impacted; the controlling organizations will be notified separately.


Please contact us with any questions / comments / concerns.
Read more »



Sep
10

Update: Fri 17 Sep 2021 09:31:45 PM MDT

During maintenance, a cluster node self-isolated, causing its workload to require reboots.  The reboots are completed, and any VMs having lingering issues (a smaller subset) are being triaged by NOC staff.

Both events seemed to occur on the same node, so sensitive workloads are being adjusted to no longer use it.


We will fast-track migration of sensitive workloads to the Secondary Shared Hyper-V cluster, and make plans to switch over all workloads to it in the near future.

In the meantime, we will resume our no-change window of the primary shared hyper-v cluster and have Microsoft support revisit their attempt to identify the root cause of the trouble, announcing any maintenance that is required.


Update: Fri 17 Sep 2021 08:49:39 PM MDT

Maintenance and follow-up testing of the cluster is now underway.  We will gather diagnostics and cease testing if we detect any issues.

Update: Fri 17 Sep 2021 06:24:13 PM MDT

Earlier today, we had another event on the Primary Shared Hyper-V cluster, this time on a new host.  We've made some adjustments to the networking device that provides the layer-3 cluster vlan for this cluster, and will make further adjustments to hyper-v specific parameters before doing further and more extensive testing tonight to see if the issue persists.

There has at least been a marked decrease in the occurrences of outage incidents in response to mass live migration traffic since last night's maintenance.


Tonight's maintenance will begin at 8:30pm.


Conclusion: Thu 16 Sep 2021 08:49:48 PM MDT

Maintenance is complete.  All objectives of maintenance were achieved with no production impact, and the cluster was tested and validated to no longer face instability after a large live-migration event (such as pausing a node for maintenance).  We will continue to keep a close eye on this cluster overnight, but those with Highly-Available VMs should see no impact from here.


Thu 16 Sep 2021 08:03:06 PM MDT

The second part of this maintenance (removal of the problematic nodes from our hyper-v failover cluster) will begin on schedule, shortly. 


Update: Fri 10 Sep 2021 07:58:34 PM MDT

The second part of the maintenance is now rescheduled to take place at 8:00 PM on the 16th, rather than 7:00 PM, to take the availability of Microsoft escalation support into account.

We still do not expect any issues from that maintenance, but will have them on standby as a precaution.

Update: Fri 10 Sep 2021 07:30:57 PM MDT

The first part of this maintenance is completed, without incident.  We will follow up on Thursday the 16th for the next step.


Purpose of Work:


We will be making some after-hours changes with our primary shared Hyper-V failover cluster, which hosts a portion of highly-available VPS instances that customers without a dedicated private cloud may rely on.

The maintenance will have two phases:

On September 10, at 7PM MDT, we will be changing the cluster role's possible owners to exclude two hypervisors that have shown intermittent issues responding to calls from our backup appliance.  We expect no impact from this event, beyond having backups work more reliably after the fact.


On September 16, at 7PM MDT, we will be removing those two hypervisors from the cluster to see if those same hypervisors are the cause of the occasional instability on the cluster that we've been working with Microsoft to resolve. 

We have not had any of these events on this specific cluster since 8/12/2021.  Now that a considerable amount of workload is on our secondary shared hyper-v failover cluster, we'll be seeing if we can eliminate even the possibility of these events.



Impact of Work:

Work will begin at 7PM (MDT) each night.


For both events, no impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages if the instability we're attempt to resolve is not fixed as a result of this maintenance.

If that is the case, we will implement mitigations immediately, possibly blending both maintenance events to try and prevent any more incidents while we're at it.

Any customer VMs that experience issues as a result of this maintenance will be recovered ASAP, with customers informed individually if their VM is going to experience a longer-than-reboot outage as a result of any events.


We will inform you when maintenance is complete.

Please contact us with any questions / comments / concerns.


Read more »



Aug
10
Purpose of Work:
August's Patch Tuesday has arrived, and we have more than a few vulnerabilities that are publicly known or being leveraged in attacks this month.

First off, there's an Elevation of Privilege vulnerability for Windows 10 / Windows server 2019+, leveraging the "Windows Update Medic Service", a new auto repair service in newer releases of Windows ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-36948).  This Microsoft has detected active exploitation of this vulnerability.  Elevation of privilege is of course going to be especially alarming in the context of a webserver, or any web-accessible service that is both easily discovered and easily interacted with.

Secondly, we have an LSA spoofing vulnerability, affecting all versions of Windows Server since 2008.  ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-36942 )It seems this vulnerability can be used to trigger unexpected behavior in hosts through the LSARPC interface (a feature in SMB). This appears to be most impactful when using it to force a domain controller to authenticate against another server using NTLM without any level of access.  This vulnerability is currently publicly known, and we'll be patching it tonight, in addition to reviewing the additional guidance section throughout the week.

Third of all, we have yet another Remote Code Execution vulnerability, affecting all versions of Windows since Server 2008, and leveraging the Print Spooler service ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-36936 ).  This one is listed as required 'low' privileges, so it likely isn't as much of a showstopper as last month's "PrintNightmare" bug.   It is, however, another publicly disclosed vulnerability.

Fourth on the menu, there is an RCE vulnerability affecting all versions of windows since Server 2008, and leveraging the TCP/IP stack ( https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-26424 ).  The specific example given in the executive summary is "This is remotely triggerable by a malicious Hyper-V guest sending an ipv6 ping to the Hyper-V host. An attacker could send a specially crafted TCPIP packet to its host utilizing the TCPIP Protocol Stack (tcpip.sys) to process packets.".  I do not know if this affects Hyper-V exclusively, but that may be the case.  

Fifth up, we have a slightly unusual one: an RCE vulnerability leveraging the Remote Desktop Client.  This one affects versions of Windows since 2008R2, and seems to be public at time of writing.  It's also one that would primarily affect endpoints connecting to a compromised server; for example, opening the Hyper-V console to look at a compromised guest, or RDPing into a compromised server directly.


There are other critical vulnerabilities, but those are enough reason for us to proceed with the usual round of reboots on patch night, rather than letting things update on an automatic schedule.


Impact of Work:


All affected hosts that are 2012 and up will be rebooted automatically / ASAP to propagate fixes, starting at 9:30PM, with some exceptions.

Internal systems on Windows 2012 and up (such as the management portal) may be temporarily impacted in the time it takes to reboot them.  Mail delivery to our helpdesk may be temporarily halted while our mail servers are updated as part of this patch cycle.  If you receive a delivery failure, you can still reach us by logging directly into the helpdesk and submitting a ticket directly via the portal, or calling us at 303-414-6910 x2, for emergencies.  


Hypervisors in a failover cluster will have rolling reboots done, in order to eliminate VPS downtime on said clusters.  Hypervisors not in a failover cluster will either be updated overnight, or have their updates scheduled, depending on customer policy / VM density.

Any hosts where updates are managed directly by the customer (or an approval process is required for zero-day updates) will not be impacted; the controlling organizations will be notified separately.


Please contact us with any questions / comments / concerns.
Read more »



Jul
27

[Completion, Tue 27 Jul 2021 11:46:42 PM MDT]: Maintenance has concluded, with all required diagnostics gathered.  We were able to anticipate issues and move VMs to other nodes for all but one requested reproduction of the issue.  22 VMs had to be shut down and started again over the span of 5 minutes to recover in that case, but all VMs recovered in such a way were VMs with less sensitive workloads.  

Alerts for these VMs are clearing normally, and support will be able to reach me or the on-call engineer in addition to notifying any affect clients, if any alerts persist beyond what we'd expect from a normal reboot.



Purpose of Work:


We will be gathering diagnostics on our shared failover cluster overnight, while reproducing a specific issue that has a chance of affecting the workload of one of two member hypervisors (out of the seven in the cluster).

Sensitive workloads (such as specific mail servers, specific database servers, RDS hosts, or RAS hosts) will be moved away from the problem HVs prior to diagnostics being run, and will also be configured to avoid the two problematic nodes in the future, until this issue is resolved.

Said issue only occurs during mass live migrations (such as when a node is paused for updates), and thus will not occur during production hours under normal circumstances.

Diagnostics were requested by Microsoft Business Support, and will be relayed to their escalation personnel.


Impact of Work:

Work will begin at 9PM (MDT) tonight.

No impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages resulting in a read-only filesystem.

Any customer VMs that experience issues as a result of this maintenance will be rebooted for recovery purposes ASAP, with customers informed individually if their VM is going to experience a longer-than-normal outage as a result of any events.

We will inform you when maintenance is complete.


Please contact us with any questions / comments / concerns.


Read more »



Jul
20

[Update, Tue 20 Jul 2021 11:10:50 PM MDT] Mitigations recommended by the vendor did not have the desired effect; a subset of VMs had to be rebooted, but did recover normally with minor intervention.

We will continue to engage the vendor and announce further scheduled tests.  Work for tonight is concluded.


Purpose of Work:


We will be testing a new configuration with our primary Hyper-V failover cluster, which hosts our highly-available VPS instances that customers without a dedicated private cloud may rely on.

A new virtual network will be introduced to the cluster, configured to handle cluster and live migration traffic, and then several tests of cluster functionality will follow.

The aim of this configuration change (recommended by Microsoft support) is to mitigate occasional instability we've noticed this week when re-introducing a node to the cluster after a maintenance even on said node.


Impact of Work:

Work will begin at 9PM (MDT) tonight.

No impact should occur, in theory, but it is possible that a subset of VMs will experience brief outages if the instability we're attempt to resolve is not fixed as a result of this maintenance.

Any customer VMs that experience issues as a result of this maintenance will be recovered ASAP, with customers informed individually if their VM is going to experience a longer-than-normal outage as a result of any events.

We will inform you when maintenance is complete.


Please contact us with any questions / comments / concerns.


Read more »



Jul
16
[Update 1] Emergency Maintenance for Primary Monitoring Server, July 16 2021
Posted by David Cunningham on 16 July 2021 02:54 PM

[Update, Sat 17 July 2021 1:12PM MDT]: Maintenance is complete.


[Update, Sat 17 July 2021 12:18 PM MDT]:
Our team is resuming work to expand storage resources within our monitoring infrastructure.

[Update, Fri 16 Jul 2021 08:32:16 PM MDT]:
Maintenance was aborted at 6:13PM, and we have been monitoring normally since then, resolving any pending alerts that we could not see during this maintenance.  We will continue this maintenance over the weekend, updating this post when a time is decided.



Purpose of Work:


Our primary monitoring server will be brought offline today, with data migrated to a higher capacity drive to account for the growing size of the database and mitigate the potential for emergency space conditions.


Impact of Work:

The automated monitoring that support staff relies on will be offline for an estimated period of 1 hour, starting at 4PM MDT.  

During this time we will not have any monitoring data for fully managed servers, with clustered websites being the exception in some cases (as some of those rely on another service in addition to our primary monitoring). 

Any trouble with your services can still be reported to the support team via helpdesk, email or phone number, as usual.  ([email protected], 303-414-6910 x2)

We will inform you when maintenance is complete.


Please contact us with any questions / comments / concerns.


Read more »