Contenido temático de las boletas censales 1920-2010

In this chapter of the thesis, we investigated the effectiveness and efficiency of cluster-based retrieval for various clustering scenarios and using several parameters. We showed that CBR is a worthwhile retrieval technique as an alternative or complementary approach to FS. To improve the efficiency of typical-CBR, we first proposed some alternative query processing techniques. Next, as the most essential contribution of this chapter, we introduced a cluster-skipping inverted index structure (CS-IIS) that is shown to be superior to the other CBR strategies that use an ordinary document-level index. We presented a wide range of experiments involving automatically clustered and manually categorized datasets, and automatically and manually determined best-cluster sets. In all cases, CS-IIS provides significant improvements for the in-memory (CPU) time efficiency. Fur- thermore, under the realistic assumption, we showed that the slightly larger disk access cost of CS-IIS can also be compensated by the aforementioned gains.

Finally, we emphasize that typical-CBR with CS-IIS can be even more efficient than FS, if the best-cluster computation time is not involved. In the next chapter, we further improve our CS-IIS data structure so that the above restriction can be relaxed. We will also provide efficiency results in a framework where all index files are stored in a compressed form, a practice which is possibly adapted by all large scale IR systems and Web search engines.

Search Using Document Groups:

Incremental Cluster-Based

Retrieval

In this chapter, we propose a modified version of our cluster-skipping inverted index structure (CS-IIS) and a new, incremental, cluster-based retrieval (CBR) approach. In Section 4.1, we discuss the motivation for this part of our re- search and list our major contributions. Next, in Section 4.2, we introduce the incremental-CBR strategy that operates on top of the CS-IIS, which is enriched to include cluster centroid information. In Section 4.3, we discuss the compression of the CS-IIS with an emphasis on the benefits of document id reassignment in our framework. Sections 4.4 and 4.5 are devoted to experimental setup and results, respectively. We extensively evaluate the proposed strategy and compare to an enhanced FS implementation based on dynamic pruning and skips [79]. In Section 4.6, we show that our new strategy coupled with CS-IIS can be used in a dynamic pruning framework for Web search engines, where the documents are simply clustered according to their Web sites. Finally, we conclude and point to future work directions in Section 4.7.

CHAPTER 4. SEARCH USING DOCUMENT GROUPS: INCR.-CBR 87

4.1 Introduction

In Chapter 3, we have introduced the CS-IIS to improve the efficiency of second stage of typical-CBR; i.e., selecting the best-documents that belong to the best- clusters. In this chapter, we attempt to optimize both stages of typical-CBR so that almost no redundant work is done. That is, our goal is to design a CBR strategy that can overcome the efficiency weaknesses of typical-CBR and be as efficient as FS while still providing comparable effectiveness with FS and typical-CBR. We envision that our new CBR approach can either be used in environments that are inherently clustered/categorized due its own application requirements (such as the Web directories or digital libraries), or in the cases where the collection is clustered essentially for the purposes of search efficiency; i.e., as another dynamic pruning technique for FS (see Section 3.2.1.2 for others). We propose some modifications on CS-IIS and based on this structure introduce a new CBR strategy. In the modified CS-IIS file, in addition to the cluster membership information, within-cluster term frequency information is also embedded into the inverted index. By this extension, centroids are now stored along with the original term posting lists. This enhanced inverted file eliminates the need for accessing separate posting lists for centroid terms (recall that typical-CBR uses a separate centroid IIS for best-cluster selection, as shown in Figure 3.1). In the new CBR method, the computations required for selecting the best-clusters and the computations required for selecting the best-documents of such clusters are performed together in an incremental and interleaved fash- ion. The query terms are processed in a discrete manner in non-increasing term weight order. That is, we envision a term-at-a-time query processing mode in this work; whereas another highly efficient alternative, document-at-a-time is out of scope [15]. As we switch from the current query term to the next, the set of best-clusters is re-computed and can dynamically change. In the document matching stage of CBR only the portions of the current query term posting list corresponding to the latest best-clusters set are considered. The rest is skipped, hence is not involved in document matching. During document ranking, only the members of the most recent best-clusters set with a non-zero similarity to the

query are considered.

In the literature, it is observed that the size of an inverted index file can be very large [118]. As a remedy, several efficient compression techniques are proposed that significantly reduce the file size. In this chapter, we concentrate on the IR strategies with compression where the performance gains of our approach become more emphasized. Indeed, our incremental-CBR strategy with the new inverted file is tailored to be most beneficial in such a compressed environment. That is, skipping irrelevant portions of the posting lists during query processing eliminates the substantial decompression overhead (as in [79]) and provides further improvement in efficiency. In compression, we exploit the use of multiple posting list compression parameters and reassign document ids of individual cluster members to increase the compression rate, as recently proposed in the literature [23, 102].

The proposed approach promises significant efficiency improvements: If the memory is scarce (say, for digital libraries and proprietary organizations) and the index files have to be kept on disk, the incremental-CBR algorithm with modified CS-IIS allows the queries to be processed by only one direct disk access per query term (assuming that a posting list is read entirely at once). Furthermore, even if the centroid and/or document index is stored in memory, which is probable with the recent advances in hardware (see [105], as an example), the CS-IIS saves decoding and processing the document postings that are not from best-clusters, a non-trivial cost. We show that, the most important overhead of CS-IIS, longer posting lists, is reduced to an affordable overhead by our compression heuristics and even with a moderate disk, the gains in efficiency can compensate for the slightly longer disk transfer times (given that the number of clusters tends to be much smaller than the number of documents, as discussed in the previous section).

Our comparative efficiency experiments cover various query lengths and both storage size and execution time issues in a compressed environment. The results even with lengthy queries demonstrate the robustness of our approach. We show that our approach scales well with the collection size. In the experiments,

CHAPTER 4. SEARCH USING DOCUMENT GROUPS: INCR.-CBR 89

we use multiple query sets and three datasets of sizes 564MB, 3GB and 27GB, corresponding to 210,158, 1,033,461 and 4,293,638 documents, respectively.

4.1.1 Contributions

Our contributions in this chapter are:

• Introducing a pioneering CBR strategy: we introduce an original CBR method using an enriched cluster-skipping inverted index structure and refer to it as incremental-CBR. The proposed strategy interleaves query- cluster and query-document matching stages of typical-CBR for the first time in the literature.

• Embedding the centroid information in document inverted indexes: For memory-scarce environments (e.g., private networks, digital libraries, etc.) where the index files should be kept on disk, we eliminate disk accesses needed for centroid inverted index posting lists by embedding the centroid information in document posting lists. This embedded information enables best-cluster selection by only accessing the document inverted index. By this way during query processing, each query term requires only one direct disk access rather than separate disk accesses for centroid and document posting lists. (We assume that a posting list for a term is entirely fetched once it is located on the disk. It is also possible to read a posting list in a block-by-block manner for some index organizations and early pruning purposes. Even in this case, embedding centroid information may allow one less direct disk access. In such a setup, alternative organizations of CS-IIS can also be possible, e.g., by sorting the cluster blocks in each list according to some importance score and then reading the list in a blockwise manner. These directions are left as a future work.)

• Outperforming full search (FS) efficiency: we show that for large datasets incremental-CBR outperforms FS (and the IAE strategy, which usually

serves as a baseline typical-CBR strategy as discussed in the previous chapter) in efficiency while yielding comparable (or sometimes better) effectiveness figures. We also show that efficiency of our approach scale well with the collection size. The proposed approach is also superior to an enhanced implementation of FS approach that employs the “continue” pruning strategy accompanied with a skipping IIS, as described in [79].

• Adapting the compression concepts to a CBR environment: we adapt multiple posting list compression parameters and specify a cluster-based document id reassignment technique that best fits the features of CS-IIS. • CBR experiments using a realistic corpus size with no user behavior assump-

tion: we use the largest corpora reported in the CBR literature, assume no user interaction, and perform all decisions in an automatic manner. Only a few studies on CBR use collections as large as ours (e.g., [73]).

In document República Dominicana: Boletas utilizadas en los censos de población y vivienda (página 10-94)