• No se han encontrado resultados

MARCO TEÓRICO

PHPA PREFLUJO (PARTIALLY HYDRATED POLY ACRYLATE)

2.4.3.2 Espaciadores Base aceite

EML techniques are commonly used to improve the effectiveness or the accuracy of var- ious IR problem areas such as in document indexing problem (Cordon et al., 2003;Liu, 2009). The objective functions used in Evolutionary Computation (EC) and Machine Learning (ML)techniques usually rely on the relevance judgements to determine the qual- ity of the evolved candidate solutions. The following sections outline previous research carried out regarding the document indexing problem (Zobel and Moffat, July, 2006) us- ing EC techniques. The document indexing problem refers to the process of assigning weights to each term that exists in every document in the collection. This type of problem can be divided into: 1) evolving term-weighting schemes (TWS), and 2) evolving term weights.

Evolving Term-Weighting Schemes

In this category, researchers have tried to evolve the best TWS for improving IR effectiveness using Genetic Programming (GP). However, these TWS can be considered as collection-based functions, because each test collection has different characteristics. Furthermore, all the test collections are partially judged to simulate real test collections at the beginning of IR systems. As a result, most of the index terms in a test collection do not exist in the training queries and their relevant documents (see section 3.1.3).

Furthermore, there is a need for using a TWS to collect the relevance judgement values before using EC techniques and it is impractical to evolve TWS without relevance judgement values. Moreover, evolving TWS technique does not guarantee having better IR effectiveness than mathematical TWS in test collections different from the ones used in evolving procedure. Hence, research work in (Fan et al., 2000;Oren, 2002;Cummins and O’Riordan, 2006) has been carried out for evolving TWS in IR research field and they did not consider these issues.

The first approach for evolving a weighting function using GP was developed by (Fan et al.,2000) using two test collections. One was the Cranfield collection containing 1,400 documents and 225 queries. The other was the Federal Register (FR) text collection from TREC 4 containing a huge number of documents (55,554 documents) compared to its queries (50 queries). Fan et al. argued that few documents were relevant for these queries so they chose a larger number of documents (2,200 documents) than the number of rele- vant documents as a training set. They used the precision based on collections relevance judgement with a threshold as a fitness function in their application. The evolved TWS created with their GP approach was used to test the same trained queries on the whole test collections. Their results outperformed TF-IDF. However, no results for the Cran- field collection have been shown with this approach (Fan et al., 2000). This technique is population-based EML method which requires large memory size and consequently large evolving time. These limitation are not existing in (1+1)-Evolutionary Algorithms.

(Oren, 2002) proposed employing GP to evolve the term-weighting function using a terminal set similar to the one used by Fan et al. discussed above, but with an additional function operator (square root). Oren used the Cystic Fibrosis database (Shaw et al., 1991) which consists of 1239 documents and 100 queries, comparing his approach to the TF-IDF term-weighting scheme. His method outperformed TF-IDF with regards to recall-precision values. In that experiment, a cluster of computers was used due to the problem size. Thus, the computational cost of Oren’s approach even for the small collection used, was very high.

global term-weighting schemes from small test collections. They showed that their global weighting function evolved on small collections also increased the average precision on larger test collections. However, their local weighting function evolved on small collec- tions did not perform well on large collections. They conducted experiments on five test collections: Medline, Cranfield, CISI, NPL and Ohsumed. The computational runtime required by their approach on the smallest training set from the Medline collection was significant: 6 hours on a standard PC. Thus, the main limitations of their approach are: 1) long computational time and large problem size on medium and large test collections, 2) the issue of test collections being partially judged, 3) evolving local and may be global TWS can not be generalised from test collection to another, and hence poor performance on collections other than the training set. Cummins and O’Riordan identified that full term weighting scheme evolved on small test collections did not outperform Okapi-BM25 on large test collections (Ohusmed88, Ohsumed89, Ohsumed90-91 and NPL collections).

Evolving Term Weights

Genetic Algorithms (GA) have been used for evolving term weights to produce better document representations for whole test collections. These approaches are also based on the relevance judgement. The same drawbacks noted previously arise when using these approaches: the reliance on partial relevance judgement for the collection and the need to run the GA again after changes occur within the collection (adding more documents to the test collection). This because the added documents to the collection requires document-weight representation that should be assigned by GA rather than traditional TWS.

(Gordon, 1988) proposed the first approach of applying a GA to IR for adapting the term weights for every document in the corpus. He demonstrated the value of using a GA for adapting term weights instead of using probabilistic models. He also highlighted some issues of using probabilistic models, such as dependencies among index terms, dependency on the estimation of probabilities, relevance judgement based on a small set of queries and high computational cost of automated probabilistic models. In this research, the GA used a probability of crossover equal to 1 with no mutation

and relevance feedback adaptation as the fitness function. Findings showed that the GA improved document representation to distinguish between relevant and non-relevant queries. The problem size was very large, more than the document space as it consisted of multiple representations for each document.

(Vrajitoru, 1998) also applied a GA to adapt term weights. The approach used a new dissociated crossover and tested different ways to generate the initial document descriptions. Vrajitoru conducted experiments using two test collections (CISI and CACM collections), which were both larger than Gordon’s chosen collection (Gordon, 1988). However, this approach also had the same limitation related to the relevance judgement due to the nature of the test collections.

The research limitation identified above of evolving TWS and evolving document- term weights was the motivation for proposing a new perspective of evolving global term weights rather than evolving TWS and evolving document-weight representations. Chap- ter6 shows a new method for optimising document-weight representations by evolving global term weights using (1+1)-Evolutionary Gradient Strategy. This technique has the lowest problem size than the above techniques and it considers the limitation of relevance judgement values at the beginning of IR systems.

Documento similar