TCH-Dick Posted October 3, 2014 Posted October 3, 2014 We are currently investigating a power issue. This issue is affecting several servers and part of our network. Our data center techs are onsite and reviewing the issue now. Thanks for your patience and we will provide updates shortly.
LeeGoldsmith Posted October 3, 2014 Posted October 3, 2014 What's up with the unni server, I don't even see it in the server status list?? Thanks Lee
TCH-Alex Posted October 3, 2014 Posted October 3, 2014 We are sorry for the delay in updating this thread, but all the staff were busy on working on the servers. We are now restored 95% of the servers and working hardly on the remaining server as of now.
kweilbacher Posted October 3, 2014 Posted October 3, 2014 Alex, why is coruscant not showing up in the TCH real-time status list page???
gmml Posted October 3, 2014 Posted October 3, 2014 (edited) I guess I'm in the 5% then? Any update when everything will be back up? Edited October 3, 2014 by gmml
TCH-Alex Posted October 3, 2014 Posted October 3, 2014 We have done 98% of the servers as of now. Just a few more servers remaining and the entire team is working on it.
Jhaacker Posted October 4, 2014 Posted October 4, 2014 Please provide an update. Our site is on boblo. It has been down for over 6 hours.
Sub_John Posted October 4, 2014 Posted October 4, 2014 I'm getting worried and missed some deadlines. Hope the backup was there
sabathedog Posted October 4, 2014 Posted October 4, 2014 ANy chance of a status? It's been 1 1/2 hours since 98% and still down
gmml Posted October 4, 2014 Posted October 4, 2014 (edited) Approaching 8 hours down now. 4 hours since 95% and 2.5 since 98%. Edited October 4, 2014 by gmml
StuartBridge Posted October 4, 2014 Posted October 4, 2014 (edited) My sever is up but my website is missing for my reseller site http://www.npmahome.com/ my other sites are working off the reseller but not the main doamin. when will these be fixed? Edited October 4, 2014 by StuartBridge
TCH-Alex Posted October 4, 2014 Posted October 4, 2014 We are sorry for the inconvenience. But we are still working on the last set of servers, that is not up yet. We understand the downtime is painful, but kindly allow us some time to work on the remaining servers. We appreciate your patience and cooperation regarding this.
Head Guru Posted October 4, 2014 Posted October 4, 2014 Hello, All services have been restored a few hours back. We have one pending issue which is an emergency restoration of the server unni. All other services, shared, reseller, dedicated, vps and colocation have all been restored. I will be releasing a full disclosure once all the facts of this incident are compiled. As Alex stated, we are all very sorry for this issue and we will continue to strive to do our best to handle any issues that arrise. Thank you for your business. 1
jbsquires Posted October 4, 2014 Posted October 4, 2014 I appreciate the dedication you guys have, I know when things go south it's a battle to get them turned around.
Head Guru Posted October 4, 2014 Posted October 4, 2014 I appreciate the dedication you guys have, I know when things go south it's a battle to get them turned around. You are the reason I love this job so much. Thank you for your kind words and more importantly thank you for your support and business.
Blackcat Posted October 5, 2014 Posted October 5, 2014 Years and years proud customer from oversea Thank you guys for all the hard work. Simply the best
kweilbacher Posted October 6, 2014 Posted October 6, 2014 When can we expect some type of post-mortem report? I have customers that I need to provide a response to this outage. Thanks.
Head Guru Posted October 6, 2014 Posted October 6, 2014 Years and years proud customer from oversea Thank you guys for all the hard work. Simply the best Thank you so much for your support, it means the world to us. 1
Head Guru Posted October 6, 2014 Posted October 6, 2014 When can we expect some type of post-mortem report? I have customers that I need to provide a response to this outage. Thanks. We are waiting for a report from the UPS manufacturer on what went wrong with the UPS unit. I have some prelimary data, but until I am confident I am holding off.
Head Guru Posted October 14, 2014 Posted October 14, 2014 Update concerning the outage that occurred :: The incident was due to a tripped 250A circuit breaker in the wrap around bypass cabinet on the output side of our UPS-1 system. Our three other UPS Units, UPS-2, UPS-3, and UPS-4 were completely unaffected. This issue caused a power disruption to circuits fed from UPS-1 only. The UPS Vendor has investigated the cause of the breaker tripping and has not identified any faulty equipment downstream of the affected breaker at this time. The breaker was replaced as of October 10th, 2014 and we continue to monitor the situation. Our vendor has performed testing of UPS-1 and verified it is producing proper voltages and waveforms and do believe the problem with this system was directly sourced to the breaker. We are still awaiting a final invoice and report from our UPS vendor, and once this is in my hands I will post it directly to the forums. Thank you
Head Guru Posted October 22, 2014 Posted October 22, 2014 Here is our official report concerning the outage that occurred on UPS1. On Friday, October 3, one of four UPS systems in our DC1 facility experienced a fault causing it to dropcustomer load. We have concluded our analysis of this incident and will be presenting the results here: At approximately 2:43pm on Friday, October 3, there was a brief utility power interruption. This caused all UPSsystems to operate on battery power for a brief period while the generator came online. All systems operatedproperly and customer load was supported by generator power for a period of approximately 30 minutes.On re-transfer to utility power, there was an unusually “hard transfer”. This means that the automatic transferswitch (ATS), which normally attempts to transfer when the sine waves of the utility and generator power are atapproximately the same levels, transferred with a phase misalignment resulting in a more disruptive transferevent than normal. This is not normally an issue for customers since all critical customer loads are UPSprotected and UPS systems will filter this as they would any other power anomaly.In this past event, UPS-1 did not filter the transfer event in the usual fashion. UPS-1 instead triggered anautomatic internal bypass of the inverter. The unit will do this in the case of downstream overload conditions aswell as in the case of internal equipment faults. There was an event in the fall of 2013 during which one of thetwo internal circuit breaker motor-operators that comprise the load transfer system failed. That unit wasreplaced last year and the new breakers (both units were replaced as a precautionary measure last year) operatedproperly in this recent event. The combination of the hard transfer and the automatic bypass triggering resultedin one of the isolation breakers in the wraparound bypass cabinet tripping. This caused customer load to bedropped from this UPS. The UPS was manually bypassed to restore customer load while emergency service wasbeing arranged.Subsequent tests of the UPS inverted showed proper waveform and voltage levels. No internal problems werefound with the inverter or bypass system. We believe the cause of the tripped circuit breaker was due totransformer inrush (magnetizing) current due to the combination of the automatic bypass event and the hardtransfer of the ATS. Upon the successful completion of testing, and the corrective actions taken (describedbelow), customer load was retransferred to UPS the Saturday immediately following the UPS bypass event.We believe the bypass event was caused by the UPS triggering the automatic bypass system at the time of thehard transfer due to the action of a TVSS system (which is essentially a very large surge protector). There aremultiple TVSS systems in the building to protect against power transients such as those caused by switchinglarge, inductive loads such as motors and transformers.We have taken the following corrective actions to ensure this problem does not recur:1- We have installed a new TVSS on the 120/208 side of UPS-1 distribution transformer #2.2- We have inspected the trip levels for all critical-bus circuit breakers to ensure the instantaneous triplevels are set properly.3- We have, as a precautionary measure, restrung one of two battery strings on UPS-1 that wasapproaching end of life.4- We have checked all other TVSS systems in the building for proper operation.UPS-1 recently, in the fall of 2013, underwent a full scheduled maintenance which included the replacement ofall AC and DC bus capacitors as well as several fans that were not performing satisfactorily. This work wasperformed simultaneously with the replacement of the faulty circuit breaker motor operator mentionedpreviously. There are no wear items within the unit that are near end of life at this time so the UPS is not atincreased risk of failure.We do not anticipate any further issues with our UPS-1 system.
Recommended Posts