Truths of Google Google Information Google Tools Google Hacking Google Vulnerable Google Hacking
Logo Google Truths
Information Retrival System
Truth Google - Home Truth Google - Sitemap Truth Google - Contact
Home Sitemap Contact
Login Here
Works Google Tips Google Tricks Google Techniques Google Secrets Google Search Engines Google
Advertising Tools Communication Tools Software Tools Publishing Tools Search Tools Development Tools
 Advanced Search Title FileTypes


Google News Google Supports Google Searching Google Techniques Google Products Hacking of Google
How Google Works
» How Google Indexer Works
» How Google Spider Works
» How Google Query Processor
» How Google WebCrawler Works
» How Google Page Rank Works
» How Google AdWords Works
» How Google AdSense Works
» How Google Audio Ads Works
» How Google Click-2-Call Works
» How Google PPC & CPC Works
» How Google Translate Works
» How Advanced Search Works
» How Google Search URL Works
» How Google Print Works
» How Works Robots.txt
Google Official Informations
» Google Search
» Google Services
 
Google Tools
» Advertising Tools
» Communication Tools
 
Google Tips & Tricks
» GMail Secrets Tricks
» Orkut Secrets Tricks
 
GOOGLE TRUTHS - HOW GOOGLE WORKS - How Robots.txt Works
Google Truth How Google Search Works Google Tool Work Google Truths Works Google Tools Works Google Tips Tricks
How Robots.txt Works

Robots.txt Works

Google Truths System Google Search Works Google Hacking Tool Google Hacking Preventing Google Tools Services Products Google Info Tips Tricks

The Google Crawler is very greedy. Sometimes, it happens content on a webpage is too sensitive to be exposed on google. Or one would want to avoid hackers and crackers find a server in order to exploit it. In those cases (and many others) the robots.txt file can help. The google bots will look into every domain it 'crawls' for a specific file called robots.txt. Normally, it's put in the / of the server. This file contains thus information about what pages can (not) be indexed by search engines. Let's try to understand the basic structure of this special file.

The easiest way to create a robots.txt file is to use the Generate robots.txt tool in Webmaster Tools. Once you've created the file, you can use the Analyze robots.txt tool to make sure that it's behaving as you expect.

Once you've created your robots.txt file, save it to the root of your domain with the name robots.txt. This is where robots will check for your file. If it's saved elsewhere, they won't find it.

You can also create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.

Syntax
The simplest robots.txt file uses two rules:

  • User-agent: the robot the following rule applies to
  • Disallow: the URL you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

What should be listed on the User-agent line?

A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:

User-agent: *

As you see, the options in this special file are quite self-explaining. Let's move on to the most important part, the rules. The idea is that all links and pages are allowed (this means they can be indexed by the engines), except when otherwise stated. In other words, we will only need to provide paths and files that are disallowed for the engines. For example if we don't want the engines the crawl the /cgi-bin directory, our robots.txt file could look like this now:

User-agent: *
Disallow: /cgi-bin/


This will actually ban all robots from crawling the domain. If you only wanted to disallow a path and a file (or more than one file) use this for example:

User-agent: *

Disallow: /cgi-bin/
Disallow: /secrets/
Disallow: confidential.html


This will allow the robots to crawl the domain, except for the cgi-bin and the secrets folder, and one file called confidential.html. This is how basic robots.txt files look like. For more examples, use google (yes, again!). You could use this command to find a lot of advanced robots.txt files:

filetype:txt robots

Google Truths : Hacking Tool
» Files Containing Juicy Info
» Files Containing Usernames
» Files Containing Passwords
» Error Messages
» Footholds
» Vulnerable Login Portals
» Sensitive Network Pages
» Vulnerable Servers
» Sensitive Directories
» Vulnerable Files
» Online Shopping Cart Info
» Various Online Devices
» Web Server Detection
Google Advanced Operators
» define » spell
» info » id
» filetype » ext
» movie » music
» lyrics » author
» intext » allintext
» inurl » allinurl
» intitle » allintitle
» inanchor » allinanchor
» site » source
» cache » link
» related » insubject
» book » phonebook
» location » time
» stocks » store
» group » maps
» daterange » weather
» safesearch » crack
Vulnerability Informations
» Unix » Linux
» Windows » Mac
» Web Server » Directories
» Usernames » Passwords
» Oracle » PL/SQL
» MS Access » Foxpro
» PHP » ASP
» JSP » .NET
» Network » Devices
» Webcams » Printers
» Movies » Music
» Books » Images
» Templates » Torrent
» Rapidshare » Megaupload
» Cracks » Serial Key
» Full Version Software & Utilities
Google Hacking : Prevention
» Finding the Data First
» Folder and File Scanning
» Vulnerability Classification
» Common Misconceptions
» Sorting Through the Results
Google Google Google Google Google Google

 

 

 

         
Google Google Google Google Google Google

 

 

 

Google Google Google Google Google Google
WHO WHAT WHERE WHEN WHY HOW
Google Google Google Google Google Google
Google Google Google Google Google Google
Conclusion Google Truths