Update 1:10 AM MT, August 13, 2019
At this time, we became cautiously optimistic that the immediate trouble was over, but we very carefully monitored the situation for the next 45 minutes, while also beginning to look into various logs. After deep research into GBs of logs, we were able to definitively determine that the initiating event for this outage was related to rebooting a particular VM. We actually rebooted this particular VM last week, and experienced an isolated network issue which impacted two of our many Hyper-V clusters. Upon rebooting the VM, the same general sequence of events occurred, except that the volume of network traffic involved was significantly higher today, which had a far larger impact.
- At 10:42 am, a customer reported to us that this virtual machine was not responsive. Our technical team investigated the issue and determined the virtual machine required a reboot.
- At 10:44 am, a virtual machine on the “WEHOST2-VPS” Hyper-V cluster is rebooted because it is non-responsive. Upon reboot, it begins sending anomalous network traffic, which has a cascading impact.
- At 10:47:43 am, the hypervisor hosting this virtual machine begins throwing memory allocation errors, related to vRss network queues. By 10:55:56 am, the hypervisor has over 53GB of memory allocated.
- At 10:51:30 am, the hypervisor that was providing service to the virtual machine crashed.
- At 10:55 am, the hypervisor recovers, and continues to fill up vRss network queues.
- At 10:56:56 am, fpc1 on dist3.denver2 indicates it is in alarm for DDOS protocol violations. (Note: dist3.denver2 is a Juniper QFX 5100 virtual chassis, which should N+1 redundancy and high availability. Each node is referred to as an fpcX.)
- At 10:57:21 am, our primary switch stack – dist3.denver2 - reports interface states between fpc0 and fpc1 are unstable.
- At 10:57:33 am, fpc0 goes into DDOS alarm as well.
- At 10:58:15, fpc1 crashes.
- At 10:59:15, fpc0 crashes. At this time, both of nodes of our redundant distribution switch cluster are offline and network access is impacted for all clients.
- Within 3-4 minutes, we have senior engineers in front of the switch stack which is crashed, and also remotely connecting to other switches via our OOB network to determine their status.
- Over the next 30 or so minutes, fpc0 and fpc1 crash repeatedly. fpc0 crashed at 11:02 am, 11:12 am, 11:16 am and fpc1 crashed at 11:06 am, 11:15 am, and 11:30 am. During these periods, there was some intermittent network connectivity for some impacted clients.
- Our review of the logs from both fpc0 and fpc1 during this time indicates that virtual chassis never fully converged, which resulted in spanning tree loops on our network.
- At approximately 11:35 am, we powered down both fpc0 and fpc1 and brought them back online. Unfortunately, they did not come up cleanly.
- At 11:40 am, we began to physically isolate (both power and network) fpc1.
- At 11:45 am, once fpc1 was isolated, we reboot fpc0.
- fpc0 came back online at 11:55 am, but was stuck in “linecard” mode, requiring us to manually remove low level config files and restart certain processes.
- At 12:03 pm we completed this process.
- At 12:04 pm we see interfaces physically coming back online.
- At 12:06 pm we begin receiving UP alerts from our monitoring system.
At the time the outage began this morning, we were actually in the process of reviewing the logs from the initial event on August 7th for root cause analysis purposes, and the development of a remediation plan.
After Action Plan
We will soon be publishing a series of network maintenance windows to restore redundancy to dist3.denver2 and make other network adjustments that should provide increased reliability going forward. The first maintenance window will likely take place Tuesday, August 13, 2019 starting at 9:00 pm. Further updates will be provided.
Update: 12:49PM MT
Here is an initial timeline.
10:48AM - We receive an alert that an internal system is down and begin investigating the issue.
10:55AM - Reviewing the logs on one of our primary sets of core/distribution switches, we notice some weird errors and begin investigating.
11:01AM - We physically isolate the issue as being related to dist3.denver2, a qfx5100 virtual chassis switch stack. We see physical indications that the chassis themselves are cycle of booting, coming online for a few minutes, and then crashing.
11:40 AM - After several of these cycles, we physically power down both of the nodes, and only power a single node back online.
11:50 AM - A single chassis is online, but fails to take mastership status of the virtual chassis cluster (the switch was stuck in linecard mode).
12:02 PM - We successfully remove the virtual chassis configuration, and reboot the node in the cluster.
12:10 PM - The node is fully booted and network connectivity is restored.
Our current status is that we have a loss of redundancy, but no clear root cause. We are continuing to review the logs and seek out solutions, possible options of the actual root cause include intermittently failing hardware or a Juniper bug.