• No se han encontrado resultados

4. METODOS

4.1. EVALUACIÓN AGRONOMICA DE LOS ENSAYOS

4.1.4. Manejo de los ensayos

Germane document identification efficiently finds germane documents with the assumption of bag-of-words; while answer entity extraction effectively extracts the answer entities from the germane documents by considering the semantic relations between words. The TREPM model delineates whole entity retrieval problem. Chapter 3 theoretically demonstrates that entity retrieval can be interpreted as the TREPM model, which decomposes the problem into document retrieval and entity extraction, using a probability model. That is, p(e|q, t) = P

dp(d|q, t)p(e|d, q, t). The TREPM model provides a method to retrieve information in a

finer granularity but with low system workload.

This decomposition helps to break the black box of entity retrieval into document retrieval and entity extraction. It not only allows the evaluations on each individual layer, which further improves the overall system performance, but also helps to bring the state-of-the-art techniques in the document retrieval and entity extraction into the entity retrieval task.

answers, this study focuses on entity retrieval task itself and treats the entities as answers instead of entities’ URLs/URIs as answers. This model summaries the general problems of entity retrieval, which can be applied to TREC can INEX task also.

7.2 GERMANE DOCUMENT IDENTIFICATION

Germane document identification (Chapter 5) discusses how to effectively locate the highly relevant germane documents, which contain as many answer entities as possible.

Some methods are investigated for germane document identification. First, we study how to generate the proper queries in order to collect germane documents and how to set up the threshold to choose germane documents. Both the narratives and topic entities could be the source of queries for searches. The experiment indicates that in most cases the narratives are a better source for the queries. However, when the narratives are sensitive in representing the relation between the topic entity and the target entity (e.g., “organizations awarded Nobel Prizes” vs “organizations that award Nobel Prizes”), the topic entity is better to be the queries.

Second, the entity type language model is investigated to evaluate whether the similarity between entity types and document categories can improve germane document identification. The documents with associated categories widely exist in the Web environments. The entries in the knowledge base, assigned with some categorizes or the posts in the social network with their tags, can be viewed as one of this type of documents. The experiment indicates that entity types or document categories are helpful for germane document identifications. The entity type language model can significantly improve the entity search result in the documents with their categories.

Last, we investigate the “learning to rank” method for germane document identifica- tion. The learning to rank approach treats germane document identification as a binary classification problem. Twenty-eight features are generated from queries, the hits, and the linguistic features used for the classification. The evaluation indicates that the learning to rank method can achieve high accuracy on germane document identification. With the anal-

yses on the annotations of germane documents, there is a germane document including all answers to the topic for most topics. But there are still some topics whose answers scatter in several documents. Wikipedia is an important source for the answer sets because we find the germane documents from the Wikipedia for about half of the topics.

Current evaluation on germane document is based on the comparison with the ground truth germane document sets. Therefore, precision and recall is also based on the germane documents. However, in fact, it is not so accurate to estimate the degree of these germane documents covering the answer entities. With a germane document with all answer entity set for a topic, it can still be possible to be extracted all answers although the recall of this germane document may be very low. Therefore, in the future, we should consider using the number of the answer entities as the weight for evaluating the germane documents.

7.3 ANSWER ENTITY EXTRACTION

Answer entity extraction (Chapter 6) discusses different approaches for answer entity de- tection in the entity extraction task. Entities in the germane documents can be in various contexts, which can be interpreted in multiple ways. From the physical context view, it includes html pages, plain texts, pdf files or image files. In this study, we only focus on plain texts and html pages. From the logical context view, the answer entities exist in tables/lists or the sentences. Therefore, in this thesis, I focus on answer entity extraction from these two resources.

Most of the current work on entity retrieval rely on NER tools to extract the entities with target types. This answer entity extraction method does not consider the contexts and treats the extraction as a query-independent extraction. In our study, we find that the precision of this method is low. Because the corpus is the noisy web page and the NER is trained by the grammatical corpus, NER could not correctly identify the entities for this corpora and the results from this method are not promising.

The second method uses the knowledge base (Wikipedia) for entity extraction, which hopes to extract the answer entities from the ungrammatical documents with the aim of

knowledge base. The algorithm for Wikipedia Infobox extraction is proposed and the Wikipedia entry category information for entity type filtering is discussed. Although the Wikipedia Infobox extraction can achieve high accuracy result, the recall is rather low. The entity type filtering using Wikipedia information is limited by the knowledge in the Wikipedia. With the analyses of the contexts of answer entity in the germane documents, we find that most entities are from tables and lists, which need some efficient methods for detections.

The answer entities are scattering across several HTML pages with symbolic contexts. Therefore, answer entity extraction with wrappers is introduced to extract the entities from tables or lists. This wrapper method also only works for some topics, but fails for the others. One of the reasons for the failure of the extraction is that answers are put into the pictures which cannot use text mining way to extract them. Another is the complicated table/list structure and the representation way, which can not be well extracted by the current system. Semi-supervised learning method, bootstrapping, is conducted for entity extractions. The idea of bootstrapping is that, by identifying the reliable patterns from the good seeds, the model can extract more result entities with these patterns. Although the precision of bootstrapping is high, the recall is still low because this method is limited by the quality seeds and good patterns. For the topics whose answers uniquely exist in the Web, it will be difficult to find the good quality seeds and pattens.

With the above extraction methods for answer entity extraction, the last method treats answer entity extraction as a learning problem, which is to learning the above methods as features for entity extraction. The results show that the learning based method significantly better than all the above methods individually.

Documento similar