A search engine is composed of the data structures that form the index, and the processes that interact with the index components. Several works have explored the architecture of efficient search engines [Arasu et al., 2001; Brin and Page, 1998; Zobel and Moffat, 2006]. Figure 2.5 models the interaction of the components within a search system.
The index is composed of four key structures. First, the vocabulary structure provides access to the set of terms that appear in the collection of documents that are indexed. At a minimum, the vocabulary maintains a pointer to the inverted list of each term in the collection. The vocabulary also maintains statistics about each term such as the number of documents in which the term appears, or the total number of occurrences of the term in the collection. Table 2.4 provides an example of the data that would be found in a vocabulary. For example, we can see that the term “swords” occurs a total of three times within two distinct documents.
Table 2.4: Vocabulary entries for selected terms in Tempest collection. For each term the vocabulary records the document- and collection-frequency of the term, the location of the term’s inverted list on disk, and the size of the list.
Term ft Ft Offset List Size
fairy 2 3 12718 8 farewell 3 6 12828 13 fish 6 12 13537 26 state 4 5 37739 15 storm 3 6 38128 14 swords 2 3 38858 8 sycorax 5 7 38878 19
Table 2.5: Document mapping table entries for selected documents in the Tempest collection. For each document, the table records the length of the document in bytes and terms; the count of unique terms in the document; the location of the document on disk; and the weight assigned to the document by the document ranking function.
Doc. ID Length Words Unique Words Offset Weight
611 294 45 39 84050 7.055118
612 125 18 18 84344 4.220472
613 117 17 17 84469 4.094488
614 116 18 17 84586 4.283464
615 69 9 8 84702 3.118110
Second, for each term in the vocabulary, an inverted list of postings is stored that records the locations of where that term appears in the collection. Inverted lists were discussed in Section 2.4.1.
Third, a document mapping table records statistics for each document such as document length, the number of distinct terms in the document, and the location of the document on disk. Table 2.5 shows sample entries in a document mapping table. For example, in the table we see that document number 612 contains 18 words, and has a length of 125 bytes.
Finally, some form of the collection is optionally maintained to generate the document snippets that are often presented with the query results.
Several processes interact with the index with diverse purpose. At index construction time a document parser tokenizes documents into terms. As each new document is encountered, it is assigned a document identifier, and an entry is added to the document mapping table. If required, a copy of each document is stored in the collection archive. Tokens and document identifiers are passed to the indexer. For each token, if the term has not been previously encountered, an entry is added to the vocabulary structure. If the term is not new, the vocabulary structure is updated to record another occurrence of the term. Then a posting is appended to the corresponding inverted list.
At search time, a query parser tokenizes the user query into tokens that match the format of those in the index. The query processor makes use of a document ranking metric to produce a result set. Regardless of the ranking metric employed, the following process is observed. First, for each query term the associated entry is retrieved from the vocabulary for access to term specific statistics, then its inverted list is loaded from disk. For each posting in the inverted list a partial similarity score is calculated between the document to which the posting refers and the query. If the document that the posting refers to has not been previously encountered, an accumulator is initialised with the partial score for the document. If the document has been previously encountered, its accumulator is updated with the partial score. After all query terms have been processed the set of accumulators is partially sorted to obtain the top ranked results, and document summaries are optionally produced by fetching the best matching documents from the collection archive. Result pages are then presented to the user after a query had been processed. The result pages are typically ranked by predicted similarity and often a summary or snippet of the document is presented with the results [Tombros and Sanderson, 1998].
It is clear that the evaluation of a query requires access to a wide selection of structures. Often these structures are located on disk, but optionally can also be stored in memory, or alternatively a combination of both. The size of the collection has an effect on such design decisions. For example, for small collections, the vocabulary structure can reside in main memory. However, as it has been shown that the rate at which new terms are encountered remains almost constant as collection size increases [Williams and Zobel, 2005], for larger collections it is likely that the vocabulary, or at least part of it, will reside on disk. Inverted lists form the largest part of an index, and the lists for individual terms vary in size with the lists for common language terms such as “the”, “and”, and “as” becoming extremely long. As this component forms the largest part of the index, the inverted lists are typically stored on disk. The document mapping table is directly proportional to the number of documents
in the collection. The document mapping table can reside on either disk or in main memory. Finally, result summaries of matching documents require access to the document collection itself. The document collection is significantly larger than the index and is most often stored on disk [Zobel and Moffat, 2006]. Fast access to frequently accessed on-disk data can be achieved through the use of a cache. We discuss caching further in Chapter 6.
As new pages are regularly added to the web, and existing pages change frequently [Fet- terly et al., 2004], the ability to update the search engine index is crucial. Index update strategies fall into one of two broad categories: index rebuilds and dynamic updates. An index rebuild builds a new index for the entire collection, then replaces the existing index, while dynamic updates attempt to modify the existing index [Lester et al., 2005b; B¨uttcher and Clarke, 2005a]. To limit the complexity of the experimental environment in this work, we only consider static document collections requiring a single index build, and note that index updates for the techniques presented in this work are possible via index rebuild ap- proaches.