webgyrl Posted February 6, 2005 Posted February 6, 2005 Hey gang, I remember waaaaay back there was a discussion about the robots.txt file that we include in our public_html folder. I can't seem to find it. I was wondering if someone could point me to some instructions how how to exclude certain folders from the bots so that those folders aren't indexed by the web search engines. Basically there are three folders on a website I host that they don't want web bots to crawl and index as they are private folders. They will be protected with amember, but I am not sure if this is enough to dissalow the bots from indexing them. Thanks, Nat Quote
TCH-Andy Posted February 6, 2005 Posted February 6, 2005 Hi Nat, You can read the tobots.txt specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example: ># /robots.txt file User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs The first line, starting with '#', specify a comment The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere. The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off. The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines. Quote
webgyrl Posted February 7, 2005 Author Posted February 7, 2005 Andy, So If I want all user agents to exclude certain folders I would put the following into a .text file and upload it to the public_html directory?: ># /robots.txt file User-agent: * Disallow: /folder1 Disallow: /folder2 Nat Quote
TCH-Rob Posted February 7, 2005 Posted February 7, 2005 Nat, And all robots that follow the robots.txt file will not search those folders. Quote
webgyrl Posted February 7, 2005 Author Posted February 7, 2005 OK great. Just as long as they can't get in there... then it's all good Quote
Etanisla Posted February 7, 2005 Posted February 7, 2005 OK great. Just as long as they can't get in there... then it's all good <{POST_SNAPBACK}> Keep in mind that robots.txt will only stop those bots that actually pay attention to it. All the big search-engine bots (M/Y/G) will abide by robots.txt. However, there are some that will not abide by robots.txt, and will sometimes _start_ their crawl in the very places you just made off limits. So keep an eye on your logs, and be prepared to ban by user-agent (and/or IP) in your .htaccess should a rogue bot come your way. (Sorry if I've reignited a worry. But I've learned this the hard way.) Quote
webgyrl Posted February 8, 2005 Author Posted February 8, 2005 Thanks for the warning! Will keep an eye on the logs for sure. Nat Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.