TCH-Ryan
Members-
Posts
273 -
Joined
-
Last visited
Everything posted by TCH-Ryan
-
Maintenance Announcement
TCH-Ryan replied to Head Guru's topic in TCH Data Centers Network Announcements
We have successfully completed this upgrade and we experienced zero down time during the upgrade. Thank you and have a good night. -
Maintenance Announcement
TCH-Ryan replied to Head Guru's topic in TCH Data Centers Network Announcements
We will begin this maintenance sharply at 11pm EST. I will keep you updated via this thread. -
Maintenance Announcement
TCH-Ryan replied to Head Guru's topic in TCH Data Centers Network Announcements
We are preparing to start this maintenance item. I will be providing updates through out this issue. Thanks! -
Maintenance Announcement
TCH-Ryan replied to Head Guru's topic in TCH Data Centers Network Announcements
We regret to inform that this maintenance has to be postponed due to supply issues outside of our control, more or less cause of the holidays. The new maintenance window will be on Saturday, January the 3rd at 11PM EST (-0500), we will post further updates should this maintenance window require further changes; our apologies for any inconvenience this has clearly caused. -
The reseeding process of all servers that backup too buserver2 has completed, we have turned back on the regular incremental backup schedules and data is now available for restores on the server as normal. Thank you everyone for your patience while we worked to bring buserver2 back into a production state.
-
We are currently at 30 of 48 systems reseeded onto the backup server, steadily getting there.
-
Data is still steadily repopulating onto buserver2, in total there is 48 servers that store data on this system so it may take up too 24h for this task to complete. We will provide further updates as soon as they become available, thank you again for your patience.
-
It became clear to us that that data set for backup server 2 was in such a corrupt state that the fsck was never going to complete and if it did it would have still resulted in a file system that was both inconsistent and ridden with issues. The call was made after 18h of struggling to get data restored in a viable condition, to simply reformat the partition and repopulate it with data. Although the backup servers are all raid setups, that does not protect from file system corruption as was the case on this system when we experienced the power loss at the data center. The process of repopulating the server with backups from all associated servers, will take some time and we will post updates as best possible with the progress of this task. We apologize in advance for any inconvenience this may cause and thank everyone for there patience and understanding.
-
The file system check/repair is still in progress, we will provide further updates on this matter as soon as possible.
-
We are currently aware of an ongoing issue with Level3 networks resulting in major packet loss across it US network. A phone call has been placed to our carrier provider to make changes to the routing scheme our network operates on as so to offload traffic from Level3 till they resolve the issues. We understand the clear inconvenience this is causing and thank you for your continued patience while we work to resolve it. The depth of the issue can be seen an actively monitored using the following public resource: http://www.internetpulse.net/ Further updates will be provided as soon as they become available.
-
Zach, Let me first say that you did not come off as rude by any means and the concerns you have expressed are very valid. I will not beat around the bush on this matter, the bottom line is cut and dry simple - we screwed up, yup that what I said - we dropped the ball on this one. I wish I had more comforting words to offer at hand but I am not going to make excuses on this situation as you and the community expect - deserve - better. So, on behalf of TCH, I apologize for the inconveniences we have caused you and your business during the migration process to newyork and hope we can restore you're faith in the standards you hold TCH too and that we strive to maintain. The newyork migration had a few elements to it that caused it to stand out from our previous migrations in that we parted ways from some successful habits and also implemented some new "features" that together caused numerous issues in the overall migration process. Further, there was also some run-of-the-mill human error involved that only compounded the situation, in addition to a long standing issue of not importing custom MX records on migrations that only added fuel to an already blazing fire. One of the first and foremost issues had begun from the onset, before the migration was even underway. During the preparation stages for a migration we typically run "fresh" backups on the older system, in this case server319 and then run a script that purges the backups of old accounts that are no longer active on the system. This has over time been a routine that we have been mindful of but never really acted upon other than manually running said script when required and in this situation having forgotten to run it lead to a number of inactive/previous deleted accounts being restored onto newyork. In order to see that this issue is corrected going forward and does not reoccur as a matter of simply absented mindedness, we have added the script to run on a regular basis on servers (twice a month) in addition to it now being integrated with the manual backup command we use in the pre-migration process. Those two changes combined provide more consistent assurance over the state of the local backups we maintain along with those that get transfered for migrations to new servers. Second to the above was a decision made to enable by default a feature called MySQL InnoDB support, this is a feature that in the past has caused us some issues, namely it has filled up the database partitions on servers with binary innodb files causing the SQL service to go offline till manual intervention by a system administrator. This decision was one that really separated us from previous migrations and it took a big bite out of us when it hit newyork causing the same age old issue of filling the space on the database partition. This issue was rather easy to correct in hindsight as we have done in the past, simply disable the option in our MySQL server configuration file and leave well enough alone. That however is not the only thing we did, we also went out and researched the issue further and uncovered a number of finer grained SQL configuration options that are part of MySQL 5.x server that allow us to control the binary data files created by innodb that did not exist in our previous SQL versions we used earlier in the year. The short of it is we will be in the future leveraging the new configuration options along with more discretion on changing configuration standards of new builds "on the fly", which was on our part just a really bad decision. Finally and perhaps one of the more consistent issues we encounter through most all migrations is that we do not traditionally migrate custom MX records or zone file entries, this is something we have always in the past just bit the bullet on as it is a very cumbersome process to handle. In addition to that, we also like to start with clean DNS records on the new servers as it provides a more reliable foundation for syncing into our DNS cluster. I do admit though that this is one of those issues that we have long "accepted" the status quo of doing nothing about and really those days are long gone where we can sit idle to the inconvenience this causes those that host with TCH. With that said, I have personally undertaken the task of developing a special set of scripts that will during the migration process identify accounts with custom DNS entries and import those onto the new server. I hope that this post addresses the concerns you have in addition to providing you some insight into the problems encountered with this particular migration and how we are working to see that they do not happen in the future. On behalf of everyone at TCH, thank you for hosting with us and we express our sincere apologies for the inconveniences you have encountered during this migration process.
-
We take all security very seriously here at TCH and we recently began taking another look at some of our core infrastructure that we manage including DNS which highlighted our still excessive use of DNS Recursion. Now, although this in the capacity that TCH uses it is not that big a deal, remote reporting services such as DNSreport and intoDNS feel differently on the matter and it gives clients a conflicting sense of security. The first thing you should understand is that Recursive DNS was created to speed up the performance of DNS on the Internet by allowing subordinate DNS servers to cache results from neighboring DNS servers on domains they host, which reduces traffic and response time for DNS requests. The TCH network generates over 60 million DNS queries every single day and to simply abandon something that has allowed our networks to maintain very efficient DNS performance over the years was not a simple task. What makes Recursive DNS queries dangerous is the fact that if certain situations exist it can allow the cache on the DNS server to be poisoned with forged records then any domains hosted on that DNS server would then begin reporting those forged records. This however is a symptom of broader mismanagement of DNS servers and recursive queries alone do not make for an insecure DNS server. We feel strongly on this matter that security is part of a larger and layered approach, to depend on the status of a single feature or resources (i.e recursive dns) as a measure of security is really very misleading. For further reading on Recursive DNS & Recent DNS protocol vulnerability please see: http://www.totalchoicehosting.com/forums/index.php?showtopic=32238&st=0&p=226849&&do=findComment&comment=22684 In any case, we felt that the time had come to evolve our DNS infrastructure to the next level that would provide public assurance to the integrity of our DNS servers, of which we presently manage 6 public DNS servers. This meant removing any conflicting doubt in the security setup, which in turn translated into disabling recursive DNS queries across the board. The major goals we undertook in this process were finding out what to do with the 60 million+ DNS queries our networks generate every day and how do we go about mitigating performance impacts from removing Recursive DNS. In addition we also proceeded with a full audit of the records on all our DNS servers, rebuilt fresh configuration files (these are massive 125k + line files) and ensured the proper operation of synchronization mechanisms between DNS servers. The first tasks we went ahead with was the actual auditing process of the DNS servers as we felt this was a building block position to begin with that would allow us to conduct later tasks with confidence that DNS servers were operating properly. - This involved a complete review of all the domain names we presently host and the consistency of the records versus those stored on the actual web servers that generate them. - We then went along to ensuring that ownership permissions of all DNS related data was set properly thus ensuring all components of our DNS systems can properly access data. - This was then followed by rebuilding the full configuration files that load the tens of thousands of domain configuration files files for the DNS servers. The newly generated configuration files numbered in at 125,000 lines versus the old configuration files being bloated at 175,000~ lines, this was mostly due to lots of erroneous spacing and old record references that no longer existed. - Finally we made sure that the records and configuration files matched up between our 3 pairs of DNS servers indicating proper synchronization of DNS changes. With the basics of the auditing phase behind us we then began to implement changes to all our servers in how they perform DNS resolution requests, we altered this setup so that all DNS queries for domains we do not manage are forwarded to a special set of DNS caches. These are high performance DNS systems that host no domain names directly and as such do not fall victim to the issues of DNS Cache poisoning with respect to Recursive DNS, similar to how openDNS.com operates. This is a different approach than most web hosts take as they typically "filter" recursive queries from the world so that only internal web servers and other related systems may perform recursive lookups, which presents the very real risk that any compromised system within your network can potentially poison your DNS servers. These new DNS cache systems now absorb all our external DNS load and in the process we have increased the performance of DNS resolution on our network by two fold. We did not come to this conclusion with a simple test of a couple of DNS requests, we came to it with the resolution of over 75,000 DNS requests of both valid and invalid domain records against our old and new setup. The timing of requests against our old DNS setup came out to roughly 381 requests per/second average and the performance of our new setup comes out to 736 requests per/second average which translates to handling the load of almost 3 million requests per hour during peak weekday hours without a single hiccup. These are relative to a single DNS server so taking these numbers into perspective across multiple DNS servers means by increasing performance we have in turn increased the scalability of our DNS infrastructure for the future. Having this system in place and performance tested, we then proceeded to disable recursive lookups on all our DNS servers, which was a quick and painless task that involved no direct downtime of any resources. In summary, we have increased the consistency and ease of management of our DNS servers in addition to the end goal of retaining and even increasing performance while removing our dependency on recursive DNS. You may use http://www.intodns.com to check and valid the status of recursive lookups for your domain against our DNS servers, that in fact they are disabled. The only exception to this is resellers using vanity(custom) DNS, these DNS servers are separate of our core infrastructure (26 smaller scale DNS servers) and are being updated over the next 24hours, so please check back if reseller controlled domains report recursion still enabled.
-
Scheduled Network Maintenance :: Aug. 16th
TCH-Ryan replied to TCH-Ryan's topic in TCH Data Centers Network Announcements
The maintenance for the VLAN switch went over completely flawless with DNS servers that are mainly on the VLAN in question never experiencing even the slightest of hiccups. The fiber fail over on the other hand identified some issues with our upstream BGP (border gateway protocol) setup that caused the first fail over test to fail and resulted in <2 minutes of down time. Once the issue was identified and corrected on the applicable router, we conducted 3 subsequent fail over tests in which we took down the fiber feeds in random successions making sure traffic continued to flow over the network on the alternate paths as intended - which they did. In summary, this was a very productive maintenance that allowed us to validate our redundant network infrastructure to identify and correct any issues before they impact our network availability in the event of a serious failure. We have enacted a policy of conducting quarterly fail over maintenance windows to continually test our redundancy state to make sure both the router configurations we control and those of our upstream providers still operate together as intended to provide the best fault tolerance possible. -
Scheduled Network Maintenance :: Aug. 16th
TCH-Ryan replied to TCH-Ryan's topic in TCH Data Centers Network Announcements
We are still on schedule to begin this maintenance in 25 minutes, further updates will be posted as the situation progresses. -
We will be conducting scheduled maintenance on our core network hardware on Saturday Aug. 16th at 11PM EST, we do not expect any outages of more than 5-10 minutes and the maintenance window will wrap up at 11:30PM EST. This maintenance is to conduct a fail-over test on our dual fiber circuits to ensure proper fault tolerance and in addition move a small IP Address block assigned to some of our core servers (such as reverse DNS) to a new VLAN device. Further updates will be provided as the maintenance window approaches.
-
Although we do strive to maintain our servers at peak performance and security, sometimes it is a matter of setting task priorities on what components must be updated first relative to the risks presented. At the end of the day the risks that are mitigated with the php 5.2.6 update rank far lower than those of kernel updates and server software updates (such as bind dns, exim etc...). We already have a plan of action in place to update php 5.2.6 on our servers over the coming week and in addition we will be deploying a number of enhancements such as mysqli, pdo, soap and much more as standard components on all TCH servers. If you have any further concerns, please feel free to detail them and we will address them for you.
-
Let me try to address all the concerns in this topic and I will begin first with the most recent DNS vulnerability disclosed by security researcher Dan Kaminsky. This vulnerability deals with a fundamental protocol flaw with DNS and how the protocol randomizes certain values during the communication process such as the source port of DNS queries and special sequencing ID values, or lack thereof. The vulnerability exposes what we call a potential DNS Cache Poisoning attack, since an attacker can potentially predict certain values of the communication between client and server (you and dns server), they could potentially inject unwanted values into an already established DNS query or forge DNS results against the server which would typically cache those results. For example let us assume we were vulnerable (which we are not and I will show you that in a moment) and you were requesting DNS for totalchoicehosting.com (as you do every time you load this site), an attacker could cause the returned IP results to point to a site of malicious intent which we will for this example call something.com. So when your browser is returned the DNS results, instead of totalchoicehosting.com appearing, you would find yourself sitting at something.com. The real hit from this attack is the way caching works is that your client DNS servers (typically your ISP) would cache this result then hand that same malicious result out to any subsequent requests for it (till it times out). That is the most simplified example I can put forward of this issue and as much knowledge as I do have about DNS, I reserve the right to be slightly incorrect in the above scenario =) Now onto the actual vulnerability of TCH DNS servers, at present we have 6 deployed and public DNS servers, all these servers run on high end hardware and are subject to thousands of queries per minute and maintain that request load with great success and incredibly low resource usage (on the order of less than 0.15% cpu and 40% memory usage day to day). Now the way we can test for the above mentioned vulnerability is with the dig command which shows us a deviation value for the sequencing/port values in a DNS query, a higher deviation per query means the ability to predict a value is that much harder versus a low or zero deviation value means it is easier and as such vulnerable. >execution command is: dig +short @IP porttest.dns-oarc.net txt: "64.246.50.105 is GOOD: 26 queries in 1.4 seconds from 26 ports with std dev 15462.95" "65.254.32.122 is GOOD: 26 queries in 1.8 seconds from 26 ports with std dev 19500.80" "207.44.236.88 is GOOD: 26 queries in 1.4 seconds from 26 ports with std dev 18339.32" "72.9.224.186 is GOOD: 26 queries in 1.8 seconds from 26 ports with std dev 20698.03" "204.11.34.66 is GOOD: 26 queries in 1.7 seconds from 25 ports with std dev 18767.98" "204.11.34.67 is GOOD: 26 queries in 1.7 seconds from 26 ports with std dev 18929.89" As you see above we are running query tests against our 6 DNS servers (the IP's at the beginning of the results) by using a 3rd party DNS testing facility (porttest.dns-oarc.net) and subsequently the results all have a very high deviation value and multiple source ports meaning we are not vulnerable, further you do not need to take my word for it. Now, recursive DNS queries - this is a picky subject and one we have long discussed here at TCH but at the end of the day we find recursive queries to be an advantageous feature to have enabled on our DNS servers as it enables us certain luxuries that otherwise would not be available to our servers running on a very diverse set of networks, not to mention huge performance benefits. Although there are people out there that claim the mere fact of having recursive queries enabled is a sign of a poorly administered DNS server, I would have to argue that recursive queries is simply a feature that facilitates attacks of other underlying administrative issues such as allowing public DNS Zone Transfers (AXFR). We go to great lengths to ensure that our DNS servers have all the appropriate restrictions in place to safe guard not only your domain names hosted with us but to preserve the integrity of the TCH image by not allowing our DNS servers to be used in facilitating external attacks. All the TCH DNS servers only allow zone transfer requests over the local IP address (127.0.0.1) and from trusted clustered client servers (the servers that host your sites) using special RSA based access hashes. I would really enjoy indulging further on this topic but this reply has grown to be much larger than I had initially expected and with that I am going to wrap it up in saying that we take great care in addressing the concerns of our clients and despite all I have said above, as soon as we have completed migrations from all partner data centers into our Troy Michigan facility, DNS recursion will be a thing of the past at TCH. I hope I have adequately addressed the concerns in this topic and if you have any further questions or concerns, please go ahead and reply so I can take a crack at them.
-
Scheduled Network Maintenance :: July 5th
TCH-Ryan replied to TCH-Ryan's topic in TCH Data Centers Network Announcements
The switch over has been completed without issue, we will continue to monitor the new hardware in the coming days to ensure there is no recurring issues. Thank you for your patience while we have worked to resolve this situation. -
Scheduled Network Maintenance :: July 5th
TCH-Ryan replied to TCH-Ryan's topic in TCH Data Centers Network Announcements
We will be proceeding as scheduled with the maintenance in 5 minutes, further updates will be posted as they are available. -
We will be conducting scheduled maintenance on Saturday the 5th of July through the morning of Sunday the 6th beginning at 11:00PM to 03:00AM EST (-0500), although we do not expect any prolonged downtime it is likely there will be roughly 5 minutes of visible downtime. This maintenance window is to put our new catalyst 6509 router/switch back into production after the conclusion of a 2 week burn-in phase to adequately test the replaced hardware which previously failed on it ( http://www.totalchoicehosting.com/forums/i...mp;#entry225805 ). Although we are confident in the new hardware, it is never easy to reproduce production loads in a controlled environment. As such, despite the extended burn-in phase we have put it through, the real trial will be when we flip it back into production. We will provide more updates as soon as they become available.
-
Early this evening we began experiencing hardware issues on our new cisco 6509 switch/router where-by the primary supervisor card (routing card) began crashing, although it was failing over as intended to our secondary supervisor card there was brief outages (<5 minutes) experienced on inbound traffic while the network readjusted the traffic flow. We have subsequently taken the primary supervisor card out of service and are running the router/switch off a single supervisor card at the moment which will not impact the network in any fashion performance wise. Although this removes immediate fault tolerance capability in a "hot" fashion, fear not, as we have a spare 6509 that we just migrated off of as detailed in a previous maintenance post ( http://www.totalchoicehosting.com/forums/i...showtopic=35453 ). If the situation at any time begins to worsen we will simply swap the network distribution cables into our other cisco 6509 switch/router which is always powered on and ready to receive load in the event of an emergency - it also has 2 fault tolerant supervisor cards and is a tried & true setup that ran the TCH network for over 6 months without any issue. We will be ordering a replacement supervisor card for the current production 6509 and it should arrive later this week - when we do receive the replacement part there should only be a brief outage window of 5 minutes or less while we swap the new component into the 6509 which can be done with it still powered up. The outage itself comes from swapping the fiber cable back into the new component which is nearly seamless relative to the amount of downtime. We will post further updates on the situation as it progresses further, our apologies for any inconvenience this situation may have caused.
-
June 7-8th: Scheduled Network Maintenance
TCH-Ryan replied to TCH-Ryan's topic in TCH Data Centers Network Announcements
The change over was very seamless, the preparation leading up to the migration to the new hardware has paid off in that we had less than 15 minutes of actual downtime and we encountered zero issues (knock on wood!). We will continuing to monitor things as a precaution through the rest of the scheduled maintenance window and should there be any issues of concern an update will be posted promptly. -
We will be conducting scheduled maintenance on Saturday the 7th of June through the morning of Sunday the 8th beginning at 11:00PM to 03:00AM EST (-0500), although we do not expect any prolonged downtime it is likely there will be multiple periods of 10-20 minute outages. This maintenance is to conduct a switch over to a new Cisco 6509 core switch/router device with the exact same configuration as our current 6509 except for more RAM being available in the new unit. This should be a seamless transition and the previous 6509 unit will be retired as a hot spare for any failures we may encounter in the future. We will keep you updated as the situation progresses on the evening of the 7th or if there is any changes in schedule for this maintenance window. If you have any questions or concerns regarding this event please feel free to open a ticket on our help desk, thank you in advance for your patience.
-
Congrats Alex, you have earned it.
-
I voted #1 as it just a cleaner more professional look where as #2 is a bit too over the top, too glitzy.
