bellringr Posted May 11, 2005 Posted May 11, 2005 A while back I used a little program that was posted by Don (I think) that created a robots.txt file for me. I have added several folders to the disallow list since then, but according to my Latest Visitors, bots like MSNbot are still spidering folders that I want them to stay out of. Would someone take a look at my robots.txt file please and see if I've done something wrong? Do I need to put one of these in each subdomain? thanks! Kristi Quote
TCH-Dick Posted May 11, 2005 Posted May 11, 2005 You only need the one file. Try this: ># Disallow directories User-agent: * Disallow: /blog/ Disallow: /guestbook/ Disallow: /momblog/ Disallow: /kathieblog/ Disallow: /webcam/ Disallow: /christmas/ Disallow: /myblog/ Disallow: /mompics/ # Disallow BecomeBot User-agent: BecomeBot Disallow: / Quote
Pendragon Posted May 12, 2005 Posted May 12, 2005 A while back I used a little program that was posted by Don (I think) that created a robots.txt file for me. I have added several folders to the disallow list since then, but according to my Latest Visitors, bots like MSNbot are still spidering folders that I want them to stay out of. Would someone take a look at my robots.txt file please and see if I've done something wrong? Do I need to put one of these in each subdomain? thanks! Kristi <{POST_SNAPBACK}> Keep in mind that some bots simply ignore the file no matter what you put in it. Quote
section31 Posted May 12, 2005 Posted May 12, 2005 For the record, I've disallowed some of my directories shortly after they were indexed by googlebot and slurp. It's been about 5 months, and they are still indexed. I'm guessing it takes a really long time for those indexes to expire. Quote
bellringr Posted May 25, 2005 Author Posted May 25, 2005 I can understand the pages still being in their search results, but even after I made the changes above, Slurp, Googlebot, and even BecomeBot whom I've blocked from my site completely are spidering the disallowed directories almost daily. I finally put BecomeBot's IP in my IP Deny. I'm just not sure what else to do. I want most parts of my site to be indexed, just not my blogs. Quote
TweezerMan Posted May 25, 2005 Posted May 25, 2005 This is what I'm seeing in your current robots.txt file: ># Disallow directories User-agent: * Disallow: /blog/ Disallow: /guestbook/ Disallow: /momblog/ Disallow: /kathieblog/ Disallow: /shrinkydink/ Disallow: /webcam/ Disallow: /christmas/ Disallow: /myblog/ Disallow: /mompics/ User-agent: msnbot Disallow: / *.php User-agent: Slurp Disallow: / *.php User-agent: googlebot Disallow: / *.php Disallow: / *.jpg Disallow: /myblog/ # Disallow BecomeBot User-agent: BecomeBot Disallow: / User-agent: psbot Disallow: / This is what I would suggest changing: 1) I don't know if it is case-sensitive, but from what I've read, the user-agent for Google's bot is Googlebot, not googlebot, and probably should be specified that way in your robots.txt file: >User-agent: Googlebot 2) To block Googlebot from spidering php and jpg files, you shouldn't have a space between the "/' and the "*". I'd suggest using the following (taken from Google's web site): >Disallow: /*.php$ Disallow: /*.jpg$ The is above is an extension to the robots.txt file standard. Google understands it, and I believe msnbot does also, so I'd change the msnbot section to the following: >User-agent: msnbot Disallow: /*.php$ I do not know if Yahoo's bot (Slurp) understands this or not. Assuming it does, you'd need to change it's Disallow line like the above examples. If it does not, the line will probably have no effect, as wildcards are not a part of the robots.txt specification for Disallow directives. I'm not sure exactly what directories you want to ban which bots from spidering. As your robots.txt file currently is: - Yahoo's Slurp bot is not prevented from spidering any files or directories. If the Slurp bot understands "Disallow: /*.php$", it would not spider files ending in .php, but still not be prevented from spidering any specific directories. - Googlebot is currently prevented from spidering only the /myblog/ directory, and files ending in .php and .jpg (after you've made the changes I've specified above). - BecomeBot is correctly prevented from spidering any files or directories. - msnbot is prevented from spidering only files ending in .php (after you've made the change I specified above). msnbot is not prevented from spidering any specific directories. When a bot reads your robots.txt, it only looks at the section that applies to it if there is one there; otherwise, it looks at the "User-agent: *" section. A bot won't look at two sections of the robots.txt to figure out what it can and cannot spider. If you want specific bots to stay out of directories that you've specified in the "User-agent: *" section, these directories must be specified again in each bot's separate section. For example, msnbot, Slurp, and Googlebot are all free to spider the /blog/ directory of your web site. Do I need to put one of these in each subdomain? Yes, you do. If you don't have one in them, and the bots access your pages through the subdomain, there's nothing to prevent them from crawling the entire subdomain. Take your /myblog/ directory, for example, which you have set up as a subdomain. If a bot want to crawl your weblog through your main domain, it would do so with this URL: >http://www.gryfalia.com/myblog/ ...and it would try to read a robots.txt file from here, at the root of the domain: >http://www.gryfalia.com/robots.txt If the bot crawls your weblog through your subdomain instead, it would do so with this URL: >http://myblog.gryfalia.com/ ...and it would try to read a robots.txt file from here, at the root of the subdomain: >http://myblog.gryfalia.com/robots.txt To block Googlebot from spidering your /myblog/ directory through your main domain, you'd want to have the following in the robots.txt file in your public_html directory (which you currently have): >User-agent: Googlebot Disallow: /myblog/ And to block Googlebot from spidering your /myblog/ directory through your subdomain, you need to have the following in the robots.txt file in the /myblog/ subdomain directory: >User-agent: Googlebot Disallow: / Hope this helps... Quote
bellringr Posted May 26, 2005 Author Posted May 26, 2005 Everything other than what Dick had listed above was added today using the formats the search engines suggested for blocking them, but hey, I'll try anything right now. I had: # Disallow directories User-agent: * at the top before the list of directories so that, in theory, should've blocked ALL bots from those directories. Thanks for your help - I'll mess around with this tonight and see how it goes over the next week. Quote
cajunman4life Posted May 26, 2005 Posted May 26, 2005 Even though you had the "User-agent: *" at the top, the bot will first look through the file for the entry specific for it, and read the "User-agent: *" only if there is not a specific entry for it. My advice would be to put everything under "User-agent: *" and not have specifics for other bots (unless you have a specific reason for letting one bot read your .jpg's and not another bot). Quote
bellringr Posted May 26, 2005 Author Posted May 26, 2005 That's what I did first and it didn't work, so I tried the specific ones. I'll just have to experiment til it works I guess. Quote
TweezerMan Posted May 26, 2005 Posted May 26, 2005 (edited) I had: # Disallow directories User-agent: * at the top before the list of directories so that, in theory, should've blocked ALL bots from those directories. <{POST_SNAPBACK}> It does block all bots from reading those directories (if the bot does not have a specific section elsewhere in the robots.txt that applies to it), but only if they access those pages through your main domain. It won't block any bots going through a subdomain if you also have a subdomain set up for those directories. For each directory in the first list above that is also a subdomain, you should add a robots.txt file in the subdomain's directory that contains just the following: ># Disallow spidering of subdomain User-agent: * Disallow: / Edited May 26, 2005 by TCH-David Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.