Jump to content

Googlebot & Msnbot Have Gone Crazy!


Recommended Posts

I am not sure if this is the right place or not but I need some help.

 

The Googlebot and MSNbot have gone crazy lately. They have both been crawling my forum nonstop all day. At one time, the Googlebot was hitting my forum about 12 times a minute for about 5 minutes. As far as I can tell, I am not the only one having this problem.

 

While doing some investigation on stopping them, I found this post over at phpBB.com. I am not an .htaccess expert, but I figured I would give it a try. After applying the changes, all I did was lock out my whole website. I removed the changes and get everything back to normal now but I still what to block these evil bots. Any suggestion on what I needed to do?

Link to comment
Share on other sites

What you want is called robots.txt and is a file you put out in your directory that tells the bots to leave you alone - There is a tutorial about it that will explain what to put in there.

 

Remember, if you tell googlebot to stay away then you won't get good (or any) listing on Google for your site.

Link to comment
Share on other sites

What you want is called robots.txt and is a file you put out in your directory that tells the bots to leave you alone - There is a tutorial about it that will explain what to put in there.

 

Remember, if you tell googlebot to stay away then you won't get good (or any) listing on Google for your site.

I haven't tried to change my robots.txt file yet. However, if you read the next post after the one I linked to over at phpBB, it appears that the robots.txt are being ignored. I'll try and make a change to my robots.txt tonight and see if it helps.

 

I plan on allow googlebot and msnbot back in at a later date. At this point they are killing my forum. My members or getting "page not found" errors enough to complain about it.

Link to comment
Share on other sites

Sorry, don't have time to follow every link.

 

From what you are saying I have to wonder if it really is Google or if it is just someone's bot calling itself Googlebot. I can't imagine Google ignoring robots files - they have a lot riding on their reputation.

 

It also surprises me that banning a single IP could keep googlebot out - I would think they use several bots for the millions of pages they crawl.

 

Anyway, it will be interesting to see how this turns out.

Link to comment
Share on other sites

An ip lookup of the ip I banned. Its just the top section, I'm not going to list the phone numbers and such.

OrgName:    Google Inc.

OrgID:      GOGL

Address:    2400 E. Bayshore Parkway

City:      Mountain View

StateProv:  CA

PostalCode: 94043

Country:    US

Link to comment
Share on other sites

Googlebot has trouble with all the links on a forum. Especially my cat forum site is getting hit severely. I'm running an early version of IPB. I'm looking for a way to make Googlebot not hit all those links. But I'm not sure how.

 

You know, with that trailing alphabet soup they've got goin on that version of IPB, it gets really bad. Especially since there's a link to virtually everything. I mean, one link for every post, which means one thread gets hit over and over.

Link to comment
Share on other sites

Thanks for bringing this issue up :unsure: . I thought it was happening only on my site.

 

On Feb 8th and 9th I noticed the number of visits jump from 5 to 680. My CMS showed the number of pageviews increased from 500 to 5500. This used a LOT of bandwith (and processing power I guess): 82% af all traffic was generated by this one IP gone wild.

 

On Feb 9th I had to block the ip: xxx.249.66.107

I checked it and it is a Goole IP. This is their NetRange: 66.249.64.0 - 66.249.95.255

I do not know if it is possible to forge an IP.

 

I noticed this issue just a few days after we received the two e-mails that we should keep our scripts updated. The servers were under increasing pressure, and AwStats stats pages did not autoupdate. I wonder if these two issues were related. On my server. Or on wider scale.

 

Anyway, by blocking this one "Googlebot" IP I managed to cut down 82% bandwith which was not generated by normal site visitors.

 

The issue was solved. The issue has not repeated itself. :)

 

As a permanent solution, will have to use robots.txt to limit its access.

Link to comment
Share on other sites

For me it was just over 200 MB on Feb 9th before I blocked Googlebot ip. On two previous months it was around 50 MB for the whole month. Before that it topped 20 MB at most if I recall it correctly - did not have many changes on my site content though at that time.

 

Considering it is around 200 megs for you I feel much less alarmed :) . It is possible that Googlebot just crawls slighly deeper (better?) and this is causing the increase :clapping: .

 

I believe a better robots.txt may fix this issue. My current file denies only admin folder for robots and allows all other. I might try a different approach: allow robots to some folders and deny all other for them.

Link to comment
Share on other sites

I think part of the increased number of daily visits to me are due to reorganizing my site to make it easier for google to spider. Of course this has been my goal for years. I have been trying to add more text content to get google to crawl deeper.

It seems to be paying off. In the last few weeks google has been visitng my site 12-16 times a day.

And a few hundred megs of bandwidth is so little compared to what we have.

Unlike others I see this as a good thing. :clapping:

Link to comment
Share on other sites

I did some further research on the Googlebot and MSNbot problem I was having. It seems that the problem is due to Session IDs. At least in the case of the Googlebot, the Session IDs seems to cause the Googlebot to interpret each visit to the forum as unique causing it to continually crawl the forum. So the key is to disable Session IDs for search engine bots. I did find a few solutions out there, but they don't seem to be compatible with the latest version of phpBB, v2.0.11. While doing some further searching, I found a solution that appears to work.

 

Note this Session ID MOD is only for phpBB v2.0.11.

 

In your forum folder, edit "/includes/session.php" (Be sure and save a back-up just in case)

 

Look for the section that looks like:

>function append_sid($url, $non_html_amp = false)
...
...
}

Replace the entire section with the following:

>function append_sid($url, $non_html_amp = false)
{
 global $SID;

 if (!empty($SID) &&!preg_match('#sid=#', $url) )
 {
   $agents = array('Googlebot', 'Yahoo', 'Msnbot');
   $ref = $_SERVER['HTTP_USER_AGENT'];
   foreach ( $agents as $agent )
   {
      if ( strpos ( $ref, $agent )!== false )
     {
         return $url;
     }
   }
   $url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
 }
 return $url;
}

After applying this MOD, I haven't seen 75 guests (68 of them being the Googlebot, 5 of them being the MSNbot and the remaining 2 being real guests).

Link to comment
Share on other sites

Sign-up with Google Adsense and have Google banner ads cover any extra bandwidth costs.  :clapping:

For me the problem isn't the bandwidth Google is using. It is the fact that Google was appearing as a "guest" on my forum about 70+ times, at the same time. All that activity was causing my forum and the rest of my website to run very slow.

Link to comment
Share on other sites

  • 1 month later...

I just managed to fix my old IPB. I entered information for Cookies in the System Setting of the admin panel. I had a problem with not staying logged in between sessions, so it's clear something wasn't quite right. Now, after setting those paths correctly, I lose the session ID's even when surfing without being logged in.

 

You add the domain name with a dot in front of it, and the path with a / in front of it.

Link to comment
Share on other sites

I also added this to my robots.txt:

 

Disallow: /forums/index.php?act=

Disallow: /forums/index.php?s=

 

Remember to get the path right.

 

If I've understood correctly, the search engines will then avoid anything equal to those paths, or files beginning with them. So no session ID's, and I get rid of all those print out versions etc.

Link to comment
Share on other sites

  • 2 years later...

it's a small test site using Joomla, my main site which is quite a big forum gets about 400mb a month.

 

they are very similar and have the same plugins but one has a variety of BBC news feeds, could this be an issue?

 

at the end of the day I need google hits but is there a way to make things less intence?

 

main site

rpgrm.com 400mb

 

new site

lnwa.net over 1gb

Link to comment
Share on other sites

found the problem, it wasn't google as such.

 

I'd not deleted a Joomlaboard forum, and as I had no link to it anywhere thought nothing of it. It was a test board, as the client wanted simple, I got them to go with SMF in the end.

 

Anyway there where 600 posts about all kinds of iffy stuff, as there was no restriction for membership.

 

So google was hitting lots and lots of links, they have now been deleted along with the extension, so I'll see how it goes.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...