Robots

**webgyrl** · February 6, 2005

Hey gang,

I remember waaaaay back there was a discussion about the robots.txt file that we include in our public_html folder. I can't seem to find it.

I was wondering if someone could point me to some instructions how how to exclude certain folders from the bots so that those folders aren't indexed by the web search engines.

Basically there are three folders on a website I host that they don't want web bots to crawl and index as they are private folders. They will be protected with amember, but I am not sure if this is enough to dissalow the bots from indexing them.

Thanks,

Nat

**TCH-Andy** · February 6, 2005

Hi Nat,

You can read the tobots.txt specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

># /robots.txt file 

User-agent: webcrawler
Disallow:

User-agent: lycra
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

The first line, starting with '#', specify a comment

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

**webgyrl** · February 7, 2005

Andy,

So If I want all user agents to exclude certain folders I would put the following into a .text file and upload it to the public_html directory?:

># /robots.txt file

User-agent: *
Disallow: /folder1
Disallow: /folder2

Nat

**TCH-Rob** · February 7, 2005

Nat,

And all robots that follow the robots.txt file will not search those folders.

**webgyrl** · February 7, 2005

OK great. Just as long as they can't get in there... then it's all good

**Etanisla** · February 7, 2005

OK great. Just as long as they can't get in there... then it's all good

<{POST_SNAPBACK}>

Keep in mind that robots.txt will only stop those bots that actually pay attention to it. All the big search-engine bots (M/Y/G) will abide by robots.txt.

However, there are some that will not abide by robots.txt, and will sometimes _start_ their crawl in the very places you just made off limits. So keep an eye on your logs, and be prepared to ban by user-agent (and/or IP) in your .htaccess should a rogue bot come your way.

(Sorry if I've reignited a worry. But I've learned this the hard way.)

**webgyrl** · February 8, 2005

Thanks for the warning!

Will keep an eye on the logs for sure.

Nat

Sign In

Robots

Recommended Posts

webgyrl

TCH-Andy

webgyrl

TCH-Rob

webgyrl

Etanisla

webgyrl

Join the conversation

Hosting