• No se han encontrado resultados

4. RESULTADOS Y DISCUSIÓN

4.1. ANÁLISIS E INTERPRETACIÓN DE RESULTADOS

4.1.7. Tabla de contingencia

The CH and MCR methods both require random samples of the collection to produce correct results. In practice, however, generating random samples by random queries is subject to biases; some documents are more likely to be retrieved for a wide range of queries, and some might never appear in the results [Garcia et al., 2004]. Moreover, long documents are more likely to be retrieved, and there could be other biases in the collection-ranking functions.

We tested the algorithms on collections with different ranking functions and found similar estimations; longer documents are more likely to be returned, with a similar skew for both the Okapi BM25 [Robertson et al., 1992] and Cosine [Baeza-Yates and Ribeiro-Neto, 1999] ranking functions. Figure 6.3 illustrates the performance of the CH algorithm for estimating the size of the same collection discussed in Figure 6.2. As can be seen, the size of the collection is significantly underestimated across a range of ranking parameters.

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 Number of samples 0 50 100 150 200 250 300 350 400

Number of duplicate documents

Duplicate documents 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 Size of collection Multiple capture-recapture Real size Capture-history

Figure 6.2: Performance of the CH and MCR algorithms (T = 160, k = 100, N = 301 681) when documents are selected at random.

query; however, as previously noted by Agichtein et al. [2003], this would require thousands of queries, and is therefore impractical.

6.4.1 Data collections

We develop our approach and evaluate competing techniques using distinct training and test sets. Each set comprises different collections of varying sizes as detailed in Table 6.2. The first three training collections each represent a subset of the TREC WT10G dataset [Bailey et al., 2003]. GOV-7 is a two-gigabyte collection extracted from the TREC crawl of the gov domain. DATELINE 509 is a subset of TREC newswire data created from Associated Press articles [D’Souza et al., 2004a]. Other training collections are different subsets of WT10G.

In the test set, GOV-123456 is a 12 GB subset of the TREC GOV collection, and LATimes consists of news articles extracted from TREC Disk 5. GOV-4 was also extracted from the TREC GOV dataset. The rest are different subsets of the TREC WT10G collection. The largest collections contain more than eight hundred thousand documents. Considering that the largest crawled server in TREC GOV2 dataset has fewer than 720 000 documents, this upper limit seems reasonable for our experiments.

10 60 110 160 210 260 310 360 410 Number of samples (each sample 100 docs)

0 30000 60000 90000 120000

Estimated size of collection Cosine (pivot = 0.3) Cosine (pivot = 0.5) Okapi (k1 = 0.3)

Figure 6.3: Effect of search engine type on estimates of collection size using the CH method. The actual size is 301 681 documents.

6.4.2 Compensating for selection bias

We propose that the capture methods be modified to compensate for the biases discussed ear- lier. To calculate the amount of bias, we compare the estimation values and actual collection sizes using the training set. We create random samples by passing 5 000 single-query terms to each collection and collecting the top n answers for each query. The choice of 5 000 is dictated by the daily limit of the Yahoo! search engine developer kit (http://developer.yahoo.net) and seems a practical number of queries for sampling common collections. We selected 10 as a suitable value for n. The query terms themselves should be chosen with care. Since it is hard to choose specific queries for each collection, query terms should be general enough to return answers for almost all types of collection. The queries should ideally be independent of each other. In the interest of efficiency, we should also avoid query terms that are unlikely to return any answers.

We experimented with selecting query terms at random from the Excite search engine query logs [Jansen et al., 2000], and from an index of 290 175 web pages extracted from TREC WT10G. We found that using the query log led to poor results; we conjecture that this is due to the limited breadth of popular query topics. Therefore, terms selected at random

Table 6.2: Properties of data collections used for training and testing.

Training Size Testing Size

Collections (# documents) collections (# documents)

WT10G-456 817 025 GOV-123456 807 774 WT10G-4 304 035 WT10G-12 589 094 WT10G-6 218 489 WT10G-1 301 681 GOV-7 133 834 WT10G-3 290 175 WT10G5-125k 127 375 LATimes 138 896 WT10G5-75k 75 227 GOV-4 136 176 DATELINE 509 30 507 WT10G1-55k 55 658

from query logs are less likely to be independent. In preliminary experiments, we found that index terms with a low document frequency failed to match any documents in some collections. To avoid this, we eliminate terms occurring in fewer than 20 documents. In all experiments, our initial results suggested that the CH and MCR algorithms underestimate the actual collection sizes at roughly predictable rates, due to the biases discussed earlier. To achieve a better estimation, we approximated the correlation between the estimated and actual collection size by using regression techniques on the training collections of Table 6.2; we refer to these as MCR-Reg and CH-Reg respectively:

log( ˆ|C|M CR) = 0.5911 × log(|C|) + 1.5767 R2= 0.8226 (6.7)

log( ˆ|C|CH ) = 0.6429 × log(|C|) + 1.4208 R2 = 0.9428 (6.8) where ˆ|C| is the estimated size obtained by the methods, and |C| is the actual collection size. The R2 values indicate how well the regression fits the data points (actual and estimated collection sizes). To achieve accurate regression equations, the training collections may have to be somehow similar to target collections.2 We used 25 single-word resample queries for SRS to estimate the size of collections:

ˆ |C|SRS= |S|Pdfi,c P dfi,S (6.9) 2

Recently, Bar-Yossef and Gurevich [2006], and Thomas and Hawking [2007] proposed methods for obtain- ing semi-random samples from collections. In such cases, compensating for bias is not necessary.

GOV-123456 WT10g-12 WT10g-1 WT10g-3 LA-Times GOV-4 WT10g1-55k 10000 100000 1000000 10000000

Collection size (# docs)

Real size CH-Reg MCR-Reg SRS

Figure 6.4: Performance of size estimation algorithms.

where |S| is the sample size, and dfi,c and dfi,S are the document frequencies of the query term i in the collection and the sample respectively.

Documento similar