Jump to content

Intermittent Network Outages

Recommended Posts

Early this evening we began experiencing hardware issues on our new cisco 6509 switch/router where-by the primary supervisor card (routing card) began crashing, although it was failing over as intended to our secondary supervisor card there was brief outages (<5 minutes) experienced on inbound traffic while the network readjusted the traffic flow. We have subsequently taken the primary supervisor card out of service and are running the router/switch off a single supervisor card at the moment which will not impact the network in any fashion performance wise. Although this removes immediate fault tolerance capability in a "hot" fashion, fear not, as we have a spare 6509 that we just migrated off of as detailed in a previous maintenance post ( http://www.totalchoicehosting.com/forums/i...showtopic=35453 ).


If the situation at any time begins to worsen we will simply swap the network distribution cables into our other cisco 6509 switch/router which is always powered on and ready to receive load in the event of an emergency - it also has 2 fault tolerant supervisor cards and is a tried & true setup that ran the TCH network for over 6 months without any issue.


We will be ordering a replacement supervisor card for the current production 6509 and it should arrive later this week - when we do receive the replacement part there should only be a brief outage window of 5 minutes or less while we swap the new component into the 6509 which can be done with it still powered up. The outage itself comes from swapping the fiber cable back into the new component which is nearly seamless relative to the amount of downtime.


We will post further updates on the situation as it progresses further, our apologies for any inconvenience this situation may have caused.

Link to comment

The replacement cisco gear was overnighted to our facility and arrived this morning at 10AM. Ryan has been working on configuring the new gear since it's arrival.


Last night at around 5:15PM EST, the secondary supervisor card on our new router failed and thus we moved to our backup "hot spare" Cisco 6509 router and have been flawless since the move back. So we have been running fully redundant since then. Our total downtime was around 15 minutes during this unexpected event.


Our plan is to ultimately get moved back to the newer 6509. However, this time around we are going to put the new gear under a full load test for approximately two weeks to make sure we do not run into the same issues again.


We will be announcing another window of maintenance on July 5, 2008 from 11pm till 3am. We do not expect downtime during this window to exceed 15-20 minutes. This window is needed to allow us to move our entire network over to the new routing gear.


Just a note that our router is fully redundant with two routing supervisor cards, two power supplies and multiple network interface cards. We also stock an entire "hot spare" with the same configurations. The hot spare is always ready to take the load in the event that our primary router fails. After looking over the failure of our new router, we will were very happy that the hot spare system was in place and prevented a prolonged outage at the TCH NOC.


We will be posting more information as it becomes available and also will be posting about the new window of maintenance on the 5th of July.



Link to comment
  • Create New...