Jump to content

Recommended Posts

Posted

Hey gang,

 

I remember waaaaay back there was a discussion about the robots.txt file that we include in our public_html folder. I can't seem to find it.

 

I was wondering if someone could point me to some instructions how how to exclude certain folders from the bots so that those folders aren't indexed by the web search engines.

 

Basically there are three folders on a website I host that they don't want web bots to crawl and index as they are private folders. They will be protected with amember, but I am not sure if this is enough to dissalow the bots from indexing them.

 

Thanks,

 

Nat

Posted

Hi Nat,

 

You can read the tobots.txt specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

 

># /robots.txt file 

User-agent: webcrawler
Disallow:

User-agent: lycra
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

 

The first line, starting with '#', specify a comment

 

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

 

The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

 

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Posted

Andy,

 

So If I want all user agents to exclude certain folders I would put the following into a .text file and upload it to the public_html directory?:

 

># /robots.txt file

User-agent: *
Disallow: /folder1
Disallow: /folder2

 

Nat

Posted

OK great. Just as long as they can't get in there... then it's all good :)

Posted
OK great. Just as long as they can't get in there... then it's all good :)

 

Keep in mind that robots.txt will only stop those bots that actually pay attention to it. All the big search-engine bots (M/Y/G) will abide by robots.txt.

 

However, there are some that will not abide by robots.txt, and will sometimes _start_ their crawl in the very places you just made off limits. So keep an eye on your logs, and be prepared to ban by user-agent (and/or IP) in your .htaccess should a rogue bot come your way.

 

(Sorry if I've reignited a worry. But I've learned this the hard way.)

Posted

Thanks for the warning!

 

Will keep an eye on the logs for sure.

 

Nat

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...