Jump to content

Recommended Posts

Posted

A while back I used a little program that was posted by Don (I think) that created a robots.txt file for me. I have added several folders to the disallow list since then, but according to my Latest Visitors, bots like MSNbot are still spidering folders that I want them to stay out of. :surrender:

 

Would someone take a look at my robots.txt file please and see if I've done something wrong? Do I need to put one of these in each subdomain? :blush:

 

thanks!

 

Kristi

Posted

You only need the one file.

Try this:

 

># Disallow directories
User-agent: *
Disallow: /blog/
Disallow: /guestbook/
Disallow: /momblog/
Disallow: /kathieblog/
Disallow: /webcam/
Disallow: /christmas/
Disallow: /myblog/
Disallow: /mompics/

# Disallow BecomeBot
User-agent: BecomeBot
Disallow: /

Posted
A while back I used a little program that was posted by Don (I think) that created a robots.txt file for me.  I have added several folders to the disallow list since then, but according to my Latest Visitors, bots like MSNbot are still spidering folders that I want them to stay out of.  :)

 

Would someone take a look at my robots.txt file please and see if I've done something wrong?  Do I need to put one of these in each subdomain?  :D

 

thanks!

 

Kristi

 

Keep in mind that some bots simply ignore the file no matter what you put in it.

Posted

For the record, I've disallowed some of my directories shortly after they were indexed by googlebot and slurp. It's been about 5 months, and they are still indexed. :) I'm guessing it takes a really long time for those indexes to expire.

  • 2 weeks later...
Posted

I can understand the pages still being in their search results, but even after I made the changes above, Slurp, Googlebot, and even BecomeBot whom I've blocked from my site completely are spidering the disallowed directories almost daily. :thumbup1: I finally put BecomeBot's IP in my IP Deny. I'm just not sure what else to do. I want most parts of my site to be indexed, just not my blogs.

Posted

This is what I'm seeing in your current robots.txt file:

># Disallow directories
User-agent: *
Disallow: /blog/
Disallow: /guestbook/
Disallow: /momblog/
Disallow: /kathieblog/
Disallow: /shrinkydink/
Disallow: /webcam/
Disallow: /christmas/
Disallow: /myblog/
Disallow: /mompics/

User-agent: msnbot
Disallow: / *.php

User-agent: Slurp
Disallow: / *.php

User-agent: googlebot
Disallow: / *.php
Disallow: / *.jpg
Disallow: /myblog/

# Disallow BecomeBot
User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

 

This is what I would suggest changing:

 

1) I don't know if it is case-sensitive, but from what I've read, the user-agent for Google's bot is Googlebot, not googlebot, and probably should be specified that way in your robots.txt file:

>User-agent: Googlebot

 

2) To block Googlebot from spidering php and jpg files, you shouldn't have a space between the "/' and the "*". I'd suggest using the following (taken from Google's web site):

>Disallow: /*.php$
Disallow: /*.jpg$

The is above is an extension to the robots.txt file standard. Google understands it, and I believe msnbot does also, so I'd change the msnbot section to the following:

>User-agent: msnbot
Disallow: /*.php$

I do not know if Yahoo's bot (Slurp) understands this or not. Assuming it does, you'd need to change it's Disallow line like the above examples. If it does not, the line will probably have no effect, as wildcards are not a part of the robots.txt specification for Disallow directives.

 

I'm not sure exactly what directories you want to ban which bots from spidering. As your robots.txt file currently is:

 

- Yahoo's Slurp bot is not prevented from spidering any files or directories. If the Slurp bot understands "Disallow: /*.php$", it would not spider files ending in .php, but still not be prevented from spidering any specific directories.

 

- Googlebot is currently prevented from spidering only the /myblog/ directory, and files ending in .php and .jpg (after you've made the changes I've specified above).

 

- BecomeBot is correctly prevented from spidering any files or directories.

 

- msnbot is prevented from spidering only files ending in .php (after you've made the change I specified above). msnbot is not prevented from spidering any specific directories.

 

When a bot reads your robots.txt, it only looks at the section that applies to it if there is one there; otherwise, it looks at the "User-agent: *" section. A bot won't look at two sections of the robots.txt to figure out what it can and cannot spider. If you want specific bots to stay out of directories that you've specified in the "User-agent: *" section, these directories must be specified again in each bot's separate section.

 

For example, msnbot, Slurp, and Googlebot are all free to spider the /blog/ directory of your web site.

 

Do I need to put one of these in each subdomain?

Yes, you do. If you don't have one in them, and the bots access your pages through the subdomain, there's nothing to prevent them from crawling the entire subdomain.

 

Take your /myblog/ directory, for example, which you have set up as a subdomain. If a bot want to crawl your weblog through your main domain, it would do so with this URL:

>http://www.gryfalia.com/myblog/

...and it would try to read a robots.txt file from here, at the root of the domain:

>http://www.gryfalia.com/robots.txt

If the bot crawls your weblog through your subdomain instead, it would do so with this URL:

>http://myblog.gryfalia.com/

...and it would try to read a robots.txt file from here, at the root of the subdomain:

>http://myblog.gryfalia.com/robots.txt

To block Googlebot from spidering your /myblog/ directory through your main domain, you'd want to have the following in the robots.txt file in your public_html directory (which you currently have):

>User-agent: Googlebot
Disallow: /myblog/

And to block Googlebot from spidering your /myblog/ directory through your subdomain, you need to have the following in the robots.txt file in the /myblog/ subdomain directory:

>User-agent: Googlebot
Disallow: /

Hope this helps...

Posted

Everything other than what Dick had listed above was added today using the formats the search engines suggested for blocking them, but hey, I'll try anything right now.

 

I had:

 

# Disallow directories

User-agent: *

 

at the top before the list of directories so that, in theory, should've blocked ALL bots from those directories.

 

Thanks for your help - I'll mess around with this tonight and see how it goes over the next week. :eek:

Posted

Even though you had the "User-agent: *" at the top, the bot will first look through the file for the entry specific for it, and read the "User-agent: *" only if there is not a specific entry for it. My advice would be to put everything under "User-agent: *" and not have specifics for other bots (unless you have a specific reason for letting one bot read your .jpg's and not another bot).

Posted (edited)
I had:

 

# Disallow directories

User-agent: *

 

at the top before the list of directories so that, in theory, should've blocked ALL bots from those directories.

It does block all bots from reading those directories (if the bot does not have a specific section elsewhere in the robots.txt that applies to it), but only if they access those pages through your main domain. It won't block any bots going through a subdomain if you also have a subdomain set up for those directories.

 

For each directory in the first list above that is also a subdomain, you should add a robots.txt file in the subdomain's directory that contains just the following:

># Disallow spidering of subdomain
User-agent: *
Disallow: /

Edited by TCH-David

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...