jhollin1138 Posted February 14, 2005 Share Posted February 14, 2005 I am not sure if this is the right place or not but I need some help. The Googlebot and MSNbot have gone crazy lately. They have both been crawling my forum nonstop all day. At one time, the Googlebot was hitting my forum about 12 times a minute for about 5 minutes. As far as I can tell, I am not the only one having this problem. While doing some investigation on stopping them, I found this post over at phpBB.com. I am not an .htaccess expert, but I figured I would give it a try. After applying the changes, all I did was lock out my whole website. I removed the changes and get everything back to normal now but I still what to block these evil bots. Any suggestion on what I needed to do? Quote Link to comment Share on other sites More sharing options...
Deverill Posted February 14, 2005 Share Posted February 14, 2005 What you want is called robots.txt and is a file you put out in your directory that tells the bots to leave you alone - There is a tutorial about it that will explain what to put in there. Remember, if you tell googlebot to stay away then you won't get good (or any) listing on Google for your site. Quote Link to comment Share on other sites More sharing options...
jhollin1138 Posted February 14, 2005 Author Share Posted February 14, 2005 What you want is called robots.txt and is a file you put out in your directory that tells the bots to leave you alone - There is a tutorial about it that will explain what to put in there. Remember, if you tell googlebot to stay away then you won't get good (or any) listing on Google for your site. <{POST_SNAPBACK}> I haven't tried to change my robots.txt file yet. However, if you read the next post after the one I linked to over at phpBB, it appears that the robots.txt are being ignored. I'll try and make a change to my robots.txt tonight and see if it helps. I plan on allow googlebot and msnbot back in at a later date. At this point they are killing my forum. My members or getting "page not found" errors enough to complain about it. Quote Link to comment Share on other sites More sharing options...
curtis Posted February 14, 2005 Share Posted February 14, 2005 Googlebot was hitting my forum really heavy for several days. It was on my forum all day long hitting every page over and over. I used the forums ban ip feature to stop the bot only on the forum. It hasn't returned since banning. Quote Link to comment Share on other sites More sharing options...
Deverill Posted February 15, 2005 Share Posted February 15, 2005 Sorry, don't have time to follow every link. From what you are saying I have to wonder if it really is Google or if it is just someone's bot calling itself Googlebot. I can't imagine Google ignoring robots files - they have a lot riding on their reputation. It also surprises me that banning a single IP could keep googlebot out - I would think they use several bots for the millions of pages they crawl. Anyway, it will be interesting to see how this turns out. Quote Link to comment Share on other sites More sharing options...
curtis Posted February 15, 2005 Share Posted February 15, 2005 An ip lookup of the ip I banned. Its just the top section, I'm not going to list the phone numbers and such. OrgName: Google Inc. OrgID: GOGL Address: 2400 E. Bayshore Parkway City: Mountain View StateProv: CA PostalCode: 94043 Country: US Quote Link to comment Share on other sites More sharing options...
annie Posted February 15, 2005 Share Posted February 15, 2005 Googlebot has trouble with all the links on a forum. Especially my cat forum site is getting hit severely. I'm running an early version of IPB. I'm looking for a way to make Googlebot not hit all those links. But I'm not sure how. You know, with that trailing alphabet soup they've got goin on that version of IPB, it gets really bad. Especially since there's a link to virtually everything. I mean, one link for every post, which means one thread gets hit over and over. Quote Link to comment Share on other sites More sharing options...
stoneage Posted February 16, 2005 Share Posted February 16, 2005 Thanks for bringing this issue up . I thought it was happening only on my site. On Feb 8th and 9th I noticed the number of visits jump from 5 to 680. My CMS showed the number of pageviews increased from 500 to 5500. This used a LOT of bandwith (and processing power I guess): 82% af all traffic was generated by this one IP gone wild. On Feb 9th I had to block the ip: xxx.249.66.107 I checked it and it is a Goole IP. This is their NetRange: 66.249.64.0 - 66.249.95.255 I do not know if it is possible to forge an IP. I noticed this issue just a few days after we received the two e-mails that we should keep our scripts updated. The servers were under increasing pressure, and AwStats stats pages did not autoupdate. I wonder if these two issues were related. On my server. Or on wider scale. Anyway, by blocking this one "Googlebot" IP I managed to cut down 82% bandwith which was not generated by normal site visitors. The issue was solved. The issue has not repeated itself. As a permanent solution, will have to use robots.txt to limit its access. Quote Link to comment Share on other sites More sharing options...
TCH-Don Posted February 16, 2005 Share Posted February 16, 2005 I am just wondering how much googlebot bandwith we are talking about here. For me it is less than two hundred megs this month so far. So I am not worried. Quote Link to comment Share on other sites More sharing options...
stoneage Posted February 16, 2005 Share Posted February 16, 2005 For me it was just over 200 MB on Feb 9th before I blocked Googlebot ip. On two previous months it was around 50 MB for the whole month. Before that it topped 20 MB at most if I recall it correctly - did not have many changes on my site content though at that time. Considering it is around 200 megs for you I feel much less alarmed . It is possible that Googlebot just crawls slighly deeper (better?) and this is causing the increase . I believe a better robots.txt may fix this issue. My current file denies only admin folder for robots and allows all other. I might try a different approach: allow robots to some folders and deny all other for them. Quote Link to comment Share on other sites More sharing options...
TCH-Don Posted February 17, 2005 Share Posted February 17, 2005 I think part of the increased number of daily visits to me are due to reorganizing my site to make it easier for google to spider. Of course this has been my goal for years. I have been trying to add more text content to get google to crawl deeper. It seems to be paying off. In the last few weeks google has been visitng my site 12-16 times a day. And a few hundred megs of bandwidth is so little compared to what we have. Unlike others I see this as a good thing. Quote Link to comment Share on other sites More sharing options...
stoneage Posted February 17, 2005 Share Posted February 17, 2005 Thanks for the larger picture. I just removed the ip block - it is in my interests that Googlebot is active on my site. Will monitor the bandwith it uses though. Quote Link to comment Share on other sites More sharing options...
annie Posted February 17, 2005 Share Posted February 17, 2005 Google has started crawling more often. That's on public record. I've got 90 megs worth of Google love this month on the site where I have a forum. Haven't checked the other - where I've outlawed Google from the forum. Quote Link to comment Share on other sites More sharing options...
jhollin1138 Posted February 18, 2005 Author Share Posted February 18, 2005 I did some further research on the Googlebot and MSNbot problem I was having. It seems that the problem is due to Session IDs. At least in the case of the Googlebot, the Session IDs seems to cause the Googlebot to interpret each visit to the forum as unique causing it to continually crawl the forum. So the key is to disable Session IDs for search engine bots. I did find a few solutions out there, but they don't seem to be compatible with the latest version of phpBB, v2.0.11. While doing some further searching, I found a solution that appears to work. Note this Session ID MOD is only for phpBB v2.0.11. In your forum folder, edit "/includes/session.php" (Be sure and save a back-up just in case) Look for the section that looks like: >function append_sid($url, $non_html_amp = false) ... ... } Replace the entire section with the following: >function append_sid($url, $non_html_amp = false) { global $SID; if (!empty($SID) &&!preg_match('#sid=#', $url) ) { $agents = array('Googlebot', 'Yahoo', 'Msnbot'); $ref = $_SERVER['HTTP_USER_AGENT']; foreach ( $agents as $agent ) { if ( strpos ( $ref, $agent )!== false ) { return $url; } } $url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID; } return $url; } After applying this MOD, I haven't seen 75 guests (68 of them being the Googlebot, 5 of them being the MSNbot and the remaining 2 being real guests). Quote Link to comment Share on other sites More sharing options...
annie Posted February 18, 2005 Share Posted February 18, 2005 My two forums are both earlier versions. Did you find anything for them? Quote Link to comment Share on other sites More sharing options...
thehemi Posted February 22, 2005 Share Posted February 22, 2005 Sign-up with Google Adsense and have Google banner ads cover any extra bandwidth costs. Quote Link to comment Share on other sites More sharing options...
jhollin1138 Posted February 22, 2005 Author Share Posted February 22, 2005 Sign-up with Google Adsense and have Google banner ads cover any extra bandwidth costs. <{POST_SNAPBACK}> For me the problem isn't the bandwidth Google is using. It is the fact that Google was appearing as a "guest" on my forum about 70+ times, at the same time. All that activity was causing my forum and the rest of my website to run very slow. Quote Link to comment Share on other sites More sharing options...
thehemi Posted February 24, 2005 Share Posted February 24, 2005 (edited) I see that Googlebot is hounding one of my websites, too. 10x as much Googlebot traffic as -all traffic- combined. Edited February 24, 2005 by thehemi Quote Link to comment Share on other sites More sharing options...
annie Posted April 20, 2005 Share Posted April 20, 2005 I just managed to fix my old IPB. I entered information for Cookies in the System Setting of the admin panel. I had a problem with not staying logged in between sessions, so it's clear something wasn't quite right. Now, after setting those paths correctly, I lose the session ID's even when surfing without being logged in. You add the domain name with a dot in front of it, and the path with a / in front of it. Quote Link to comment Share on other sites More sharing options...
annie Posted April 20, 2005 Share Posted April 20, 2005 I also added this to my robots.txt: Disallow: /forums/index.php?act= Disallow: /forums/index.php?s= Remember to get the path right. If I've understood correctly, the search engines will then avoid anything equal to those paths, or files beginning with them. So no session ID's, and I get rid of all those print out versions etc. Quote Link to comment Share on other sites More sharing options...
ictus Posted July 30, 2007 Share Posted July 30, 2007 if google is hitting 1gb bandwidth, do you think that is cause for concern? Quote Link to comment Share on other sites More sharing options...
TCH-Bruce Posted July 30, 2007 Share Posted July 30, 2007 Depends on how much is being indexed but that seems excessive to me. Quote Link to comment Share on other sites More sharing options...
ictus Posted July 30, 2007 Share Posted July 30, 2007 it's a small test site using Joomla, my main site which is quite a big forum gets about 400mb a month. they are very similar and have the same plugins but one has a variety of BBC news feeds, could this be an issue? at the end of the day I need google hits but is there a way to make things less intence? main site rpgrm.com 400mb new site lnwa.net over 1gb Quote Link to comment Share on other sites More sharing options...
ictus Posted August 6, 2007 Share Posted August 6, 2007 found the problem, it wasn't google as such. I'd not deleted a Joomlaboard forum, and as I had no link to it anywhere thought nothing of it. It was a test board, as the client wanted simple, I got them to go with SMF in the end. Anyway there where 600 posts about all kinds of iffy stuff, as there was no restriction for membership. So google was hitting lots and lots of links, they have now been deleted along with the extension, so I'll see how it goes. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.