RSS Feed
Latest Updates
Feb
22
ONGOING NETWORK ISSUE - INTERMITTENT PACKET LOSS
Posted by Jay Sudowski on 22 February 2017 07:14 AM

UPDATE: Feb 23, 2017 19:00 Tonight while performing routine, low risk network operations (moving physical ports); our dist/core switches experienced near simultanous core dumps on their redundant routing engines; which left the switches in completely broken state. The network recovered when the routing-engines completed their boot up cycle.

UPDATE:  Feb 23, 2017 1:24AM - Late tonight we started having iBGP sessions flap all over the place.  In order to isolate the issue, we logically moved over a number of OSPF interfaces, which caused a bit of trouble for some switches.  At this point, we are done with our work (again) and monitoring the situation.  The diagnostic tool we are using, which is doing a continious ping / traceroute to each of our switches at 1801 has not detected any packet loss for 30 minutes.  

UPDATE:  Feb 22, 2017 6:50PM - We've made significant progress resolving the secondary issue, by making some minor tweaks to our configs.  However, a complete resolution will likely require us to schedule some maintenance windows to upgrade our top of rack switches to new versions of JunOS.  We will send out another notice for that tomorrow.

UPDATE: Feb 22, 2017 12:33PM - The widespread issue has been resolved.  However, it has uncovered another issue with a small subset of our top of rack switches at 1801 California St.  We are investigating this other issue as well.  The switches impacted are:

  • sw2-f4
  • sw2-a4
  • sw2-g3
  • sw2-c2
  • sw2-c3
  • sw2-b2
  • sw2-g3
  • sw1-i3
  • sw2-e7
  • sw2-d2

Similar to the last issue, these switches are suffering from micro outages that last for only very short periods of time, and as such our normal monitoring systems are not detecting these outages.

UPDATE: Feb 22, 2017 10:02AM - We are no longer detecting packet loss to the numerous end points we are monitoring.  At this time, we continue to monitor the situation.  Once we reach 90 minutes with no packet loss to the end points we are monitoring, we will close this issue out.

UDPATE: Feb 22, 2017 9:32AM - We have identified a 10G connection with an excessive number of drops.  We have disabled this interface at 9:28AM, which caused a brief network blip.  We will continue to investigate and monitor the situation.

Drops: 116,484,566

 

Date: Feb 22, 2017

Time: 7:12AM

Issue:

We are tracking several reports of small periods of packet loss for certain parts of our network, which generally result in a micro-outage of 8-15 seconds every 1 to 2 hours.  During our maintenance window last night, we made adjustments which we hoped would be effective in resolving these issues, but they have not been.

Consequently, we will be engaging in minimally invasive testing, troubleshooting throughout the day today in an effort to resolve the underlying cause of the issue.


Read more »



Feb
19
Network Maintenance - Feb 20 8PM - 2AM & Feb 21 8PM - 2AM
Posted by Jay Sudowski on 19 February 2017 06:07 PM

UPDATE: 10:00PM Feb 21, 2017 - Our work for the evening is complete.  We are going to observe the changes we made tonight and we will need to schedule another maintenance window next week to finish up the project.

UPDATE: 1:55AM Feb 21, 2017 - Our work for this evening is complete.  We will be back at it tomorrow.

Dates: Feb 20, 2017 & Feb 21, 2017
Time: 8:00PM - 2:00AM (MST)

Purpose of Work:
Replacement of legacy core1/dist2 switches and routers.  During this work, we will be taking core2 / dist1 devices offline and replacing them with new devices, that provide us with additional capacity and flexibility.

Impact of Work:
There will be several short periods (5-10 minutes) of packet loss and latency as we isolate the old devices, remove them from the network, and then bring the new devices online.  


Read more »



Feb
15
EXTENDED TILL 4AM - Network Maintenance - February 16 8:00PM - 2:00AM
Posted by Jay Sudowski on 15 February 2017 08:49 AM

Due to some unanticipated issues, (not service impacting) we are extending this maintenance window till 4AM to allow us to complete the necessary work.

 

Date: Feb 16, 2017
Time: 8:00PM - 2:00AM (MST)

Purpose of Work:
Replacement of core2 / dist1 at 1801 California St.  During this work, we will be taking core2 / dist1 devices offline and replacing them with new devices, that provide us with additional capacity and flexability.

Impact of Work:
There will be several short periods (5-10 minutes) of packet loss and latency as we isolate the old devices, remove them from the network, and then bring the new devices online.  


Read more »



Feb
6
COMPLETE - Network Maintenance - Feb 9, 2017 9:00PM - 1:00AM
Posted by Jay Sudowski on 06 February 2017 02:01 PM

UPDATE - Feb 9, 2017 11:46PM.  Our maintenance work is complete.  We successfully upgraded the line cards on our MX240 border routers and cutover all circuits to new DWDM optics.  Finally, we added an additional 10Gbps of direct interconnectivity between border1 and border2.

Date: Feb 9, 2017
Time: 9:00PM - 1:00AM (MST)

Purpose of Work:
We have recently installed new Juniper line cards in our border routers and we need to groom all existing interfaces from the old line card, which supports max of 8 10GE ports, to the new line card, which has support for 16 10GE ports.

Impact of Work:
There will be several short periods (5-10 minutes) of packet loss and latency as we turn ports down and move them to the new line cards.


Read more »



Jan
4
Network Maintenance Windows - January 6,7,8 & January 10,12,14
Posted by Jay Sudowski on 04 January 2017 04:17 PM

Dates: January 6, 7, 8 & January 10, 12, 14

Time:  Maintenance window will be open from 8:00pm mountain - 2:00am mountain time

Expected Impact:  Brief periods of latency and packet loss during the maintenance windows.

Purpose of Work:  We will be performing various network upgrades, including:

  • Bringing optical amplifiers online to increase the capacity on our dark fiber ring
  • Adding new line cards to our MX240 routers and grooming existing circuits over to them
  • Replacing our aging MX80s and EX4500 switches at 1801 California St with Juniper QFX5100 switches

During the maintenance window, we will be providing updates regarding specific activities that will be taking place during these windows.  Unfortunately, due to the scope of the work we need to perform, and the interrelatedness and dependencies of all of this work, we are bulk scheduling maintenance windows. 

 

UPDATES:

2017-01-06 21:10 MST: We are beginning the portion of our maintenance with the greatest risk for impact. If you experience any packet loss or delay it is very likely to be related.

2017-01-07 01:15 MST: We have concluded our maintenance activities for the evening.  All systems are operational.  

2017-01-07 20:34 MST: We are beginning our maintenance for the evening.  Tasks tonight include upgrading border1 to the latest JTAC recommended JunOS and then working to bring the replacements for our EX4500 switches online.

2017-01-07 23:50 MST: Our maintenance is concluded for the evening.  We encountered some odd issues with our border routers during the code upgrades, which took substantial amounts of time to investigate.  As a result, we will be working on the EX4500 replacement tomorrow evening.

2017-01-08 20:36 MST:  We will not be conducting any network maintenance this evening, and we will also be rescheduling the timing of the remaining tasks.  The schedule was not maintainable, and at this point most of our staff working on this project have gotten only 4 hours sleep over the past two nights.  A more official notice will follow, but the next maintenance events will be scheduled for Tuesday, Thursday, and Saturday this coming week, with slightly shorter windows (9PM - 1AM).  

2017-01-10 23:36 MST:  During our network maintenance for the evening, we were operating under reduced redundancy.  Unfortunately, as our luck would have it, one of our border routers decided to coredump, causing a network wide outage for approximately 25 minutes.  During these 25 minutes, we were rolling back our previous changes made to effect the maintenance window, which involved reverting both physical and logical networking configurations.  Connectivity was disrupted at 22:50 and restored at 23:15.  We are still actively working to determine more about the outage.  The maintenance window may be extended to address any issues that remain related to this unexpected problem.

2017-01-11 00:43 MST:  All systems are working as expected at this time.  All further maintenance is canceled, until we can work with our vendor to get to the bottom of the issues that occured this evening.


Read more »