How to Control Search Engines Access and Indexing of your website
Can publishers specify that some parts of the site should be private and non-searchable?The
good news is that those who publish on the web have a lot of control over which pages should appear in search results.
The key is a simple file called robots.txt
A simple example
Here is a simple example of a robots.txt file.
User-Agent: Googlebot Disallow: /logs
The User-Agent line specifies that the next section is a set of instructions just for the Googlebot. All the major search engines read and obey the instructions you put in robots.txt, and you can specify different rules for different search engines if you want to. The Disallow line tells Googlebot not to access files in the logs sub-directory of your site. The contents of the pages you put into the logs directory will not show up in Google search results.
Preventing access to a file
If you have a news article on your site that is only accessible by registered users, you'll want it excluded from Google's results. To do this, simply add a META tag into the html file, so it starts something like:
<html> <head> <meta name="googlebot" content="noindex"> …
This stops Google from indexing this file. META tags are particularly useful if you have permission to edit the individual files but not the site-wide robots.txt.
Learn more
You can find out more about robots.txt at http://www.robotstxt.org and at Google's Webmaster help center, which contains lots of helpful information, including:
- How to create a robots.txt file
- Descriptions of each user-agent that Google uses
- How to use pattern matching
- How often we recrawl your robots.txt file
Here is an useful list of the bots used by the major search engines: http://www.robotstxt.org/wc /active/html/index.html
via [Google blog ]