RSS Feed
News
Aug
22
[RESOLVED] Outages of some VMs on shared hypervisors
Posted by Lindsay Schweitzer on 22 August 2019 02:15 PM

UPDATE: This issue is resolved and all VMs should be back online. If you're experiencing any lingering issues related to this outage, please open a ticket on our helpdesk so we can address. Thank you for your patience.

We’re currently addressing issues related to the DDoS attack of a VM on one of our shared hypervisors. We’re working to resolve the issues related to this attack and have the services of all affected clients up as quickly as possible.



Read more »



Aug
18

[Completion] All but one of the dedicated clusters are upgraded, all 2016+ non-HV servers are upgraded, and the upgrade process can be left unattended at this point.  

If you have your own fully managed hypervisors, or a fully managed server on its own domain, and would like us to manage your updates, please contact us for update scheduling as soon as is feasible, to ensure these vulnerabilities are patched.


[Update 3, Tue Aug 20 20:12:10 MDT 2019]
As announced yesterday, Fully managed windows server patching and automatic reboots for certain servers on our domain will resume for 2016 build 1703 to 2019 servers (and windows 10 hosts on those builds), starting now.

As before, this will exclude managed clients' standalone hypervisors.  Hypervisors in failover clusters will have rolling updates applied manually where not already done, tonight.

Fully managed windows servers not joined to any domain (including ours) will be updated (and manually configured to use our WSUS) on a case-by-case basis.


[Update 2, Tue Aug 20 00:46:33 MDT 2019]
All pending updates have been confirmed to have been applied or initiated for fully managed windows servers on our domain, with OS levels of baseline 2016 and below.

As covered in the previous update, newer servers will have these updates applied tomorrow night, once all pending updates have downloaded to the WSUS server.

[Update, Mon Aug 19 22:02:58 MDT 2019] Due to a required option to get WSUS to download the relevant updates for Server 2019 (and some newer builds of Windows 10) not having been set yet, said OS versions will likely not be updated until tomorrow evening.

Normal updates of all Windows versions from 2008 R2 up to 2016 (for non-hypervisors) are occurring now.  Reboots may occur shortly.


Purpose of Work:
Several pre-authentication vulnerabilities targeting Remote Desktop Protocol in servers running Windows Server 2008 R2 or newer have been discovered; all of which allow for Remote Code Execution.

Because the vulnerability requires absolutely no authentication, it could be spread rapidly within a network via use of 'Worm' style malware, at which point the exploiter would effectively have full control of all infected hosts.


Due to the ease of exploitation, and the impact of exploitation, we will be patching and rebooting all affected, fully-managed hosts overnight. 

Hypervisors would be a general exception to this, and customer-owned Windows HVs that host unmanaged VMs should have their maintenance scheduled with us, separately.


You can read more about the exploit (and patches mitigating it), here: https://msrc-blog.microsoft.com/2019/08/13/patch-new-wormable-vulnerabilities-in-remote-desktop-services-cve-2019-1181-1182/

The following patches are among those that will be applied:

https://portal.msrc.microsoft.com/en-US/security-guidance/advisory/CVE-2019-1226
https://portal.msrc.microsoft.com/en-US/security-guidance/advisory/CVE-2019-1222
https://portal.msrc.microsoft.com/en-US/security-guidance/advisory/CVE-2019-1182
https://portal.msrc.microsoft.com/en-US/security-guidance/advisory/CVE-2019-1181

We will update you as maintenance begins.

Impact of Work:
All affected hosts will be rebooted automatically / ASAP to propagate fixes, starting at 8PM MDT on Monday the 19th.  

Internal systems (such as the management portal) may be temporarily impacted in the time it takes to reboot them.

Hypervisors will be done last.  Hypervisors in a failover cluster will have rolling reboots done, in order to eliminate VPS downtime on said clusters.

Any hosts not on our fully-managed domain (usually because they have their own domain) will not be impacted; the controlling organizations will be notified separately.


Please contact us with any questions / comments / concerns.


Read more »



Aug
13
Completion: Wed Aug 14 00:55:37 MDT 2019

All post-maintenance checks have completed, 99% of VMs are in normal working order, and the network is stable. 

Maintenance is concluded; any VMs not in normal working order have been determined to have pre-existing issues unrelated to the maintenance, and will have a relevant ticket opened with the customer, shortly.




Update: 12:18 AM
All network maintenance is complete.  There was a short outage, roughly 3-5 minutes in duration due to an issue we experienced with the NSSU upgrade.  Since we had engineers engaged on-site, we were able to address the problem very quickly.

All VMs that were shutdown as part of this maintenance are now restarted.  A few are throwing alerts, our helpdesk/NOC team is addressing them now.

Update: 11:23PM
NSSU update is completed.  We are bringing offline VMS on the WEHOSTVPS2 cluster back online now.

Update: 10:30PM
NSSU update is in progress.

Update: 10:13PM
We are almost complete with re-enabling redundancy on dist3.denver2.  Once this is complete, we will perform an NSSU upgrade to the chassis cluster.

Update: 9:09 PM
Our work is beginning.  We are powering off VMs on the WEHOSTVPS2 cluster now.


Date: August 13, 2019
Time: 9:00 PM - 2:00 AM (Mountain Time)

Purpose of Work:
After the network incident on August 12, 2019, we have some critical tasks that we need to carry out this evening to restore redundancy to our network core.
  1. Restore redundancy to our dist3.denver2 switch stack.

  2. Upgrade JunOS on our dist3.denver2 switch stack to the latest JTAC recommended JunOS.  

Impact of Work:
This work will impact our clients in two distinct manners:
  1. If you are hosted on a virtualized environment on the WEHOSTVPS2 Hyper-V cluster, we will be shutting down all virtual machines within this environment at the start of the network maintenance.  This cluster is most dramatically impacted by minimal network disruptions, and when things go sideways, they require hours of manual intervention to bring hundreds of VMs online, which are generally in an unclean state, require chkdsk/fsck, or have severe file system corruption.

  2. If you are a private cloud, colocation, or dedicated server customer, you will experience network disruptions up to 15 minutes in duration during this maintenance window. Theoretically, it should be possible to do the work we need to do with very little network impact, but after the events which transpired on Monday, it's hard to say that with any level of confidence.

We will update this post regularly during the course of our maintenance.  Please contact us with any questions.


Read more »



Aug
12
Network Status
Posted by Tyler Molamphy on 12 August 2019 12:38 PM

Update 1:10 AM MT, August 13, 2019

Revised Timeline


  • At 10:42 am, a customer reported to us that this virtual machine was not responsive. Our technical team investigated the issue and determined the virtual machine required a reboot.

  • At 10:44 am, a virtual machine on the “WEHOST2-VPS” Hyper-V cluster is rebooted because it is non-responsive. Upon reboot, it begins sending anomalous network traffic, which has a cascading impact.

  • At 10:47:43 am, the hypervisor hosting this virtual machine begins throwing memory allocation errors, related to vRss network queues. By 10:55:56 am, the hypervisor has over 53GB of memory allocated.

  • At 10:51:30 am, the hypervisor that was providing service to the virtual machine crashed.

  • At 10:55 am, the hypervisor recovers, and continues to fill up vRss network queues.

  • At 10:56:56 am, fpc1 on dist3.denver2 indicates it is in alarm for DDOS protocol violations. (Note: dist3.denver2 is a Juniper QFX 5100 virtual chassis, which should N+1 redundancy and high availability. Each node is referred to as an fpcX.)

  • At 10:57:21 am, our primary switch stack – dist3.denver2 - reports interface states between fpc0 and fpc1 are unstable.

  • At 10:57:33 am, fpc0 goes into DDOS alarm as well.

  • At 10:58:15, fpc1 crashes.

  • At 10:59:15, fpc0 crashes. At this time, both of nodes of our redundant distribution switch cluster are offline and network access is impacted for all clients.
  • Within 3-4 minutes, we have senior engineers in front of the switch stack which is crashed, and also remotely connecting to other switches via our OOB network to determine their status.

  • Over the next 30 or so minutes, fpc0 and fpc1 crash repeatedly. fpc0 crashed at 11:02 am, 11:12 am, 11:16 am and fpc1 crashed at 11:06 am, 11:15 am, and 11:30 am. During these periods, there was some intermittent network connectivity for some impacted clients.

  • Our review of the logs from both fpc0 and fpc1 during this time indicates that virtual chassis never fully converged, which resulted in spanning tree loops on our network.

  • At approximately 11:35 am, we powered down both fpc0 and fpc1 and brought them back online. Unfortunately, they did not come up cleanly.

  • At 11:40 am, we began to physically isolate (both power and network) fpc1.

  • At 11:45 am, once fpc1 was isolated, we reboot fpc0.

  • fpc0 came back online at 11:55 am, but was stuck in “linecard” mode, requiring us to manually remove low level config files and restart certain processes.

  • At 12:03 pm we completed this process.

  • At 12:04 pm we see interfaces physically coming back online.

  • At 12:06 pm we begin receiving UP alerts from our monitoring system.
At this time, we became cautiously optimistic that the immediate trouble was over, but we very carefully monitored the situation for the next 45 minutes, while also beginning to look into various logs. After deep research into GBs of logs, we were able to definitively determine that the initiating event for this outage was related to rebooting a particular VM. We actually rebooted this particular VM last week, and experienced an isolated network issue which impacted two of our many Hyper-V clusters. Upon rebooting the VM, the same general sequence of events occurred, except that the volume of network traffic involved was significantly higher today, which had a far larger impact.

At the time the outage began this morning, we were actually in the process of reviewing the logs from the initial event on August 7th for root cause analysis purposes, and the development of a remediation plan.

After Action Plan

We will soon be publishing a series of network maintenance windows to restore redundancy to dist3.denver2 and make other network adjustments that should provide increased reliability going forward. The first maintenance window will likely take place Tuesday, August 13, 2019 starting at 9:00 pm. Further updates will be provided.


Update: 12:49PM MT
Here is an initial timeline. 

10:48AM - We receive an alert that an internal system is down and begin investigating the issue.

10:55AM - Reviewing the logs on one of our primary sets of core/distribution switches, we notice some weird errors and begin investigating.

11:01AM - We physically isolate the issue as being related to dist3.denver2, a qfx5100 virtual chassis switch stack. We see physical indications that the chassis themselves are cycle of booting, coming online for a few minutes, and then crashing.

11:40 AM - After several of these cycles, we physically power down both of the nodes, and only power a single node back online.

11:50 AM -  A single chassis is online, but fails to take mastership status of the virtual chassis cluster (the switch was stuck in linecard mode).  

12:02 PM - We successfully remove the virtual chassis configuration, and reboot the node in the cluster.

12:10 PM - The node is fully booted and network connectivity is restored.

Our current status is that we have a loss of redundancy, but no clear root cause.  We are continuing to review the logs and seek out solutions, possible options of the actual root cause include intermittently failing hardware or a Juniper bug.
Read more »



Jul
12
Nimble SAN Firmware Upgrade - July 16, 2019 (COMPLETE)
Posted by Tyler Molamphy on 12 July 2019 02:28 PM
[Completed, 23:21 7-16-19]:  This maintenance is complete on all Nimble Storage Arrays.  As expected, no impact to workloads was observed. 

Date: 
 Tuesday, July 16, 2019
Time:  10:30PM MT

Purpose of Work:
Nimble Storage has let us know that a couple of our storage arrays are running on a version of NimbleOS that has a software defect which could lead to unexpected storage behavior in some circumstances.  We will be upgrading our storage arrays to newer firmware that resolves this issue.

Impact of Work:
Historically, we have performed many Nimble Storage upgrades which have been un-noticeable to any workloads consuming storage from our arrays.  Accordingly, we do not anticipate any impact, but there is always a slim possibility that connectivity to the storage arrays will be disrupted, resulting in workloads relying on the storage to hard reboot.

Please contact us with any questions / comments / concerns.
Read more »



Jun
19
[Complete] Zero-Day Emergency Security Patching for SolusVM HV - June 27, 2019
Posted by David Cunningham on 19 June 2019 10:28 PM
[Thu Jun 27 3:19:12 MDT 2019] - All VMs have been returned to service; we will be monitoring these VMS to see if anything else needs to be done, beyond syncing server times.


[Thu Jun 27 3:09:40 MDT 2019] - The SolusVM HV is back up, and VMs are being restored from saved states.  If your VM at 23.239.222.0/25 is not yet up, it should be, shortly.   I will update this ticket when all is resolved.


[Thu Jun 27 1:58:51 MDT 2019] - Our Solus HV did not recover as expect on reboot.   I have arrived at our DTC location and am troubleshooting it.  A conservative ETR based on what I know now is approximately 1 hour.   I will update you with further developments that affect this estimated time of resolution.

[Thu Jun 27 00:19:52 MDT 2019] - Maintenance is underway.  We will proceed with update and reboot on the SolusVM hypervisor shortly.

[Tue Jun 25 23:03:12 MDT 2019] - 
SolusVM hypervisor updates will be postponed until tomorrow night, as out of band access is not functioning as expected.  We will update you as we begin.

[Tue Jun 25 21:31:56 MDT 2019] - We will be initiating maintenance on our SolusVM hypervisor shortly.  Again, if you do not have a non-"Highly Available" cPanel server (often immediately identifiable by having only 1-3 IPs, in the subnet 23.239.222.0/25), this will not impact you.



Read more »