CONTROL ENERGÉTICO

Consumo del interruptor inalámbrico.

Since even randomly generated data can be clustered, it is important to determine whether the clusters produced when a given clustering method is applied to a given collection, are meaningful. It is even more important to determine whether the clusters produced contribute to effective infor- mation retrieval. In other words, are the clusters produced likely to satisfy the cluster hypothesis?. If a query or browsing method locates and retrieves a cluster of appropriate size, is it likely that many or most of the documents in that cluster will be relevant to the query, or of interest to the browsing user? If the user relaxes the cluster threshold, retrieving documents that were close to the boundary of the original cluster, are these new documents likely to be at least partially relevant to the user’s need?

Several approaches to cluster evaluation with specific applicability to document retrieval have been tried. These approaches try to determine whether a given collection is a good candidate for clustering, i.e., whether clustering will promote retrieval effectiveness. One approach, due to van Rijsbergen and his associates [van Rijsbergen et al., 1973] is to compare the average interdocu- ment similarity among relevant documents to the average similarity among relevant-nonrelevant document pairs. This average can be computed for a given query or over a set of queries. If the cluster hypothesis holds, the average similarity among relevant documents should be substantially larger than the average over relevant-nonrelevant pairs. A second approach, due to Voorhees, is to determine for each document relevant to a given query how many of its nearest neighbors are also relevant to the query. In her experiments, Voorhees [TR 85-658] considered the five nearest neighbors to each relevant document. These two methods both require that a query or set of queries be applied to the collection and that relevance judgments be applied to the documents retrieved by these queries. The assumption is made that the results for the given queries characterize the given collection in the sense that other queries applied to the collection will give similar results. A third approach, due to El-Hamdouchi and Willett [JIS, 1987] depends entirely on properties of the collection itself, or more precisely on the terms that index the documents in the collection. They cal- culate a term density, defined as the number of occurrences of all index terms in the collection (the number of postings) divided by the product of the number of documents in the collection and the number of unique index terms. This density is a measure of how densely populated the term-document matrix is. The theory is that the greater the term density, the more frequently documents will share terms, and hence the better a clustering can represent degrees of similarity between documents. In a reported comparison of these methods, the term density measure correlated best with effectiveness of cluster searching. [Willetts, IP&M, 1988]

As the size of distributed collections and the corresponding size of retrieval sets grow, the application of clustering to user interactive browsing also grows in importance. A number of the clustering methods described above, are specifically aimed at this application domain. Hence, some evaluations of effective clustering have also been aimed at this application. Browsing experiments have been conducted to evaluate the effectiveness of clustering for this purpose. These experiments are discussed further in the section below on User Interaction. However, one test of retrieval effectiveness [Zamir et al., SIGIR ‘98] that simulates browsing will be discussed here. This test compared the STC clustering method (described in the preceding section) against several heuristic clustering methods (described in the section on Heuristic Methods) and O(N2) methods (discussed in the section on Complete Methods). Specifically, it compared STC against four lin-

ear-time heuristic methods: Single-Pass, K-means (this is the Rocchio method), Buckshot, and Fractionation), and one O(N2) method: Group-Average Hierarchical Clustering (GAHC).

The strategy adopted by Zamir et al. is based on results reported by researchers who conducted actual browsing experiments. These experiments indicate that a user is usually (about 80% of the time) able to select the cluster containing the highest proportion documents relevant to her need, on the basis of the cluster labels or summaries provided to her. Hence, Zamir generated 10 queries, retrieved documents from the Web for each of those queries, and then manually generated human relevance judgments for each of the 10 retrieval sets, relative to the query for which it was retrieved. Then, they clustered each of the retrieval sets using each of the cluster methods, setting parameters as appropriate so that 10 clusters were generated for each retrieval set/cluster method pair. Then, for each retrieval set and cluster method, they automatically selected the “best” cluster, i.e., the cluster containing the highest proportion of relevant documents, then the next best, and so on, until they had selected clusters containing 10% of the documents in the “collection,” i.e., in the given retrieval set. This was based on the assumption, noted above and borne out to some extent in practice, that users can select the best clusters on the basis of their labels or summaries. In all cases, the cutoff was 10% of the documents in the given set; this meant that the cutoff might occur in the middle of a cluster, even in the middle of the first cluster, if that cluster was large for a given cluster method. The resulting 10% documents were then ranked, and the average precision computed, averaged over all 10 collections. (Since STC supports document overlap, a given document might appear in two or more selected clusters. For purposes of ranking, such duplicates were discarded.) Equalizing the number of clusters generated, and the number of documents ranked, across methods and collections, allowed for a fair comparison of cluster methods. Note that Zamir ranked documents by cluster, i.e., the documents in the best cluster were ranked higher than the documents in the next best cluster, and so on. Zamir et al. do not specify how they ranked documents within a given cluster. However, Hearst et al. rank documents within a cluster using two different criteria: closeness to the cluster centroid, and similarity to the original query.

The results reported by Zamir et al. show that STC out-performed the other methods, even the GAHC method, by a significant margin. GAHC out-performed the other methods, which is not surprising considering that it is an O(N2) method (and consequently far too slow for interactive use, and even for off-line use of large collections). It is striking that STC, an O(N) and incremental method, out-performed GAHC, which is neither! Zamir et al. concede that their results are preliminary; indeed, the title of their paper refers to the reported study as a “feasibility demonstra- tion.” The results are preliminary for (at least) four reasons. First, the results were obtained from non-standard, e.g., non-TREC, collections, retrieved from the Web by 10 arbitrary queries. (This was deliberate, since Web retrieval is the intended application of STC.) Second, the resulting collections were (relatively) small, e.g., 200 documents each. (On average, there were about forty relevant documents for each query.) Third, the relevance judgments were generated by the researchers rather than by independent judges, as in TREC. Fourth, the study did not use actual human users, performing actual interactive browsing. However, their system, MetaCrawler STC, has been fielded on the Web, so that statistics can be gathered from actual users. The data set employed is also being published on the Web, so that other researchers can replicate and validate these experiments.

11 Query Expansion and Refinement

A query or information need submitted by an end-user is ordinarily a short statement or an even shorter list of terms. This is only to be expected. The normal user is not necessarily an expert on all the term usages in a large collection of documents he wishes to query. Nor does he want to spend his time consulting thesauri and other reference works in an attempt to generate an ideal query. A sophisticated user may in fact do some of these things. But the approach taken in both some commercial IR engines and much IR research is to refine and expand the original query automatically based on the documents retrieved by the original query. {Salton et al., JASIS, 1990] Query refinement and expansion may involve adding additional terms, removing “poor” terms, and refining the weights assigned to the query terms. It is possible to recompute the weights without expanding the query, or to expand the query without recomputing the weights, but experiment indicates that both expansion and re-weighting improve retrieval performance. [Harman, SIGIR ‘92] The process of query expansion and re-weighting can be applied to either vector space queries or extended boolean queries. [Salton, ATP, 1989] [Salton, IP&M, 1988] The process may be wholly automatic or may involve a combination of automatic processes and user interaction.

In document Prototipo de un sistema domótico configurable a través de comandos de voz y mensajes de texto (página 95-103)