The Google Crawler is very greedy. Sometimes, it happens content on a webpage is too sensitive to be exposed on google. Or one would want to avoid hackers and crackers find a server in order to exploit it. In those cases (and many others) the robots.txt file can help. The google bots will look into every domain it 'crawls' for a specific file called robots.txt. Normally, it's put in the / of the server. This file contains thus information about what pages can (not) be indexed by search engines. Let's try to understand the basic structure of this special file.
The easiest way to create a robots.txt file is to use the Generate robots.txt tool in Webmaster Tools. Once you've created the file, you can use the Analyze robots.txt tool to make sure that it's behaving as you expect.
Once you've created your robots.txt file, save it to the root of your domain with the name robots.txt. This is where robots will check for your file. If it's saved elsewhere, they won't find it.
You can also create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.
Syntax
The simplest robots.txt file uses two rules:
- User-agent: the robot the following rule applies to
- Disallow: the URL you want to block
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
What should be listed on the User-agent line?
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
User-agent: *
As you see, the options in this special file are quite self-explaining. Let's move
on to the most important part, the rules. The idea is that all links and pages are
allowed (this means they can be indexed by the engines), except when otherwise
stated. In other words, we will only need to provide paths and files that are
disallowed for the engines. For example if we don't want the engines the crawl
the /cgi-bin directory, our robots.txt file could look like this now:
User-agent: *
Disallow: /cgi-bin/
This will actually ban all robots from crawling the domain. If you only wanted
to disallow a path and a file (or more than one file) use this for example:
User-agent: *
Disallow: /cgi-bin/
Disallow: /secrets/
Disallow: confidential.html
This will allow the robots to crawl the domain, except for the cgi-bin and the secrets folder, and one file called confidential.html.
This is how basic robots.txt files look like. For more examples, use google (yes,
again!). You could use this command to find a lot of advanced robots.txt files:
filetype:txt robots |