Jump to content

Events Of 5/10/2012


Head Guru

Recommended Posts

Dear TCH Family,

As some of you may be aware, today we suffered an outage at our Troy, Michigan Data Center. I wanted to take a few moments to explain what happened today and what actions we took to correct the issue.

 

At approximately 1:35PM EST today we suffered a power outage from Detroit Edison. Typically when we lose power, all of our power is then bypassed over through a transfer switch to our UPS Battery Farm. Once the Emergency generator is started we then automatically transfer to the Genset for power.

When the utility power failed our systems did attempt to transfer to UPS power. However, the Transfer Switch failed and we never ended up getting backup power. This meant all our equipment in the Data Center went dark with the exception of a few isolated racks that were on a secondary UPS and transfer switch. However, even the racks that did not lose power lost network connectivity due to the network gear being dark.

 

The main power outage only lasted for approximately 45 minutes and at 2:10PM we were able to restore power to all of the Data Center. At 2:00PM we were able to bypass the transfer switch and move to Generator Power. Our core network was off line for a total of 35 minutes.

 

Once power was fully restored our team went into action and started a row by row, rack by rack inspection of all our equipment. This included power cycling servers that did not properly come up on their own. We were faced with several switches that did not properly power up and those were also manually rebooted. All our routers and core networking came up without the need of any intervention.

 

We ended up replacing several network components that were damaged during the power issue.

 

Once we had all the servers powered up we then started looking at servers that were powered up but were not responding to network ping. We had approximately 60 servers that needed to be un-racked and placed on the work bench for manual inspection. We quickly narrowed the number of downed servers to about 20 and by 5pm we were down to just a handful of servers that still needed inspection. These servers needed power supply replacements and or manual File System Checks.

 

We ended up with two shared servers that took some time to get back online and two dedicated servers that were also giving us some issues.

 

At around 530pm we rang the all clear bell and everything in the data center was online and working as intended.

 

The faulty transfer switch is being replaced and we will stay on Generator Power until then. Once it is ready we will only then switch back to utility power.

 

Most of the TCH family only saw a short 45 minute outage, on the other hand other family members were down for a longer period of time. I want to personally thank everyone that reached out to me during this outage. Your kind words and support mean so much and it makes our job that much easier. I also want to reach out to the couple of very upset clients that called me during the outage and offer my sincere apologies. As most of the TCH family knows, we do not have many outages and when they do occur we work our tails off to get things back to normal.

 

As always if you would like to discuss things please feel free to respond here or call me personally.

Link to comment
Share on other sites

Just a follow up note.

 

To those clients that had hardware issues on their dedicated servers, we ended up giving both of them significant free upgrades in their hardware.

 

Utility power was restored to our data center as promised. The faulty transfer switch has been replaced and we are back on Utility power as of around 12am last night.

 

Thank you

Link to comment
Share on other sites

  • 4 weeks later...
×
×
  • Create New...