Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google's index database. Google index is sorted alphabetically by Google search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn't Google index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a Google search, and therefore they can safely be discarded. The Google indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google's performance.
|
| Indexing the Web |
 |
Parsing - Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work. |
 |
Indexing Documents into Barrels - After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the Google indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer. |
 |
Sorting - In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID. Then the sorter, loads each basket into memory, sorts it and writes its contents into the short inverted barrel and the full inverted barrel. |
 |
URL Resolver - The Url Resolver read the anchors file and converts relative urls into absolute urls and in turn into docids. It puts the anchor text into the forward index, associated with the docid that the anchor points to. It also generates a Google database of links which are pairs of docids. The links Google database is used to compute Google pageranks for all the documents. |
 |
Sorter - The sorter takes the forward index, which is sorted by docID, and resorts it by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordids and offsets into the inverted index. A program called dumplexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. |
 |
Searcher - The searcher is run by a web server and uses the lexicon built by dumplexicon together with the inverted index and the pageranks to answer queries. |
|
|