Jump to content

Network Issues Resolved

Recommended Posts

As some of you are aware there were some issues last week with the network at the TCH Data Center.


We were having servers fall off line for no apparent reasons. We at first assumed it was a bad switch as all the servers that were going off line were on the same rack and all hooked to the same aggregation switch. We replaced the switch with a brand new unit and all looked great for about 2 hours. The issue presented itself once again on the same rack of servers.


At that point I should have researched further, but I simply assumed the issue was at the core router. So we swapped out the line card in our router and hoped this issue would go away. We looked good for several hours. The issue represented itself again, with random servers simply falling off the network. I knew at that point the issue was simply a configuration issue somewhere in our network.


After running a few checks and watching network traffic the issue was resolved.


Here is what was found.


All of our servers run over two networks. We run a public network for all the public traffic to the servers, and then we run a separate network for our internal backups and such. During a routine server turn up, we mistakenly connected a backup link to our public network. This caused ARP issues across the network. Once the issue was identified we immediately corrected such and since have not had one second of downtime due to this issue.


Things are looking great once again.



Link to comment
  • Create New...