A retrieved page is itself is cached, or stored, in the Google document servers (called in Googlese the doc servers, or doc server farm), along with a PageRank. (The PageRank is used as a measurement to sort documents by importance.) With the text of a document stored in the doc servers, the post-analysis keyword content of a Web page is used to populate the Google index servers. Keywords stored in the index servers point to each document that contains the term in the doc server farm.
Google Query Processor has several parts, including the user interface (search box), the "engine" that evaluates queries and matches them to relevant documents, and the results formatter.
Google considers over a hundred factors in computing a Page Rank and determining which documents are most relevant to a Google query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. Google closely guards the formulas it uses to calculate relevance; they're tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.

When a user makes a search request, the Google Web server sends it on to software that analyzes the request to strip out words that are not indexed (mostly stripping articles and prepositions). It then sends the keywords in the request, with a proximity rating, on to the index server farm. The index servers, along with the doc servers.
- Determine the documents pointed to by the keywords
- Sort these documents using each one's PageRank
- Provide links to these documents on the Web
- Provide a link to view the cached version of the document in the doc server farm
- Pull an excerpt from the page, using the cached version of the page, to give a quick idea of what it is about
- Return an initial result set of document excerpts and links, with links to retrieve further result sets of matches, rendered as HTML
Google prides itself on the fact that most queries are answered in less than half a second. Considering the number of steps involved in answering a query, you can see that this is quite a technological feat.
|