Capitulo III. Aspectos metodológicos de la investigación
3.7 Diseño de la secuencia de aprendizaje
3.7.2 L A SECUENCIA DE APRENDIZAJE Y EL ANÁLISIS PRELIMINAR
Given a textual corpus and a user query (a question, a set of keywords, …), Passage Retrieval (PR) could be defined as the task of retrieving a set of passages from the textual corpus
60 Chapter 3. Geographical Information Access Tasks - State-of-the-art relevant to the user query. Obviously, a passage is considered a portion of a whole document. A passage could have fixed size (words, bytes or sentences) or a dynamic size (paragraph, sentence,…). In QA the aim of Passage Retrieval is to get small fragments of text (with enough context) which probably contain the answer of the question.
A typical PR system has normally two phases, an indexing phase and a searching phase. The first one, called Indexing, consists in processing all the collection and extract its essential information. Then, in a following step of the same process, the information is stored in a structure that allows an easy recovery of the primordial data by querying for some features. The core of each PR system has an Information Retrieval algorithm. IR techniques can be sub-classified in tree classes depending on its mathematical model:
• Set Models. These models represent documents by sets. The Standard Boolean model is the most popular.
• Algebraic Models. These algorithms represent documents and queries usually as vectors, matrices or tuples. The Vector Space model is the algebraic model most widely used in the IR community. In the vector space model, all the documents are mapped into a N-dimensional space in which each term represents a dimension. Each document and query is represented as a vector in this vectorial space. Document relevance with respect to a query is computed using distance measures between the document vector and the query vector. Term weighting is usually performed by TFIDF (Salton and Buckley, 1988) or Okapi’s BM25 (Robertson and Walker, 1994) schemas.
• Probabilistic Models. These models represent similarities as probabilities. In the probabilistic models the estimated relevance of a document to a query is a function of the estimated probabilities that each of the various terms in the document occur in at least one relevant document but in no irrelevant documents. Currently, Language models (LM) and Divergence From Randomness (DFR) models (Amati, 2003) are ones of the most established probabilistic models.
Information Retrieval engines are the core of most text-based QA and GeoQA systems. This paragraph lists and describes some of the most relevant existing IR engines.
• Lucene. Lucene16 IR system uses the standard tf.idf weighting scheme with the cosine similarity measure, and it allows ranked and boolean queries.
• Terrier17. Performing very well at TREC Terrier includes: parameter-free probabilis-
tic retrieval approaches such as Divergence from Randomness (DFR) models (Ounis et al., 2006), the TF-IDF (with Robertson’s TF) weighting scheme, other recent language modelling approaches, and the well-established Okapi’s BM25 probabilistic ranking formula.
• Indri (Lemur project). Indri18 (an IR component of the Lemur toolkit) is an
Information Retrieval system that supports retrieval algorithms based on Language Modelling (Ogilvie and Callan, 2001).
16Lucene. http://lucene.apache.org/java/docs/ 17Terrier. http://ir.dcs.gla.ac.uk/terrier/ 18Indri. http://www.lemurproject.org/
3.2. Geographical Question Answering - State-of-the-art 61 • JIRS. The JAVA Information Retrieval System (JIRS) software (Soriano et al., 2005) is used to retrieve relevant passages related to a question. JIRS19 was specially de-
signed for Question Answering (QA). This system gets passages with a high similarity between the largests n-grams of the question and the ones in the passage. It has 3 modes: simple n-gram model, term weight n-gram model, and distance n-gram model. • Sphinx. Sphinx20 is a full-text search engine that provides fast, size-efficient and
relevant full-text search functions to other applications. Sphinx has two types of weighting functions: Phrase rank and Statistical rank. Phrase rank is based on a length of longest common subsequence (LCS) of search words between document body and query phrase. Statistical rank is based on classic BM25 function which only takes word frequencies into account.
Indexing
Rijsbergen (1979) defined an index language as the language used to describe documents and requests. The elements of the index language are index terms, which may be derived from the text of the document to be described, or attached to it. Usually, documents are indexed using its words as an indexed terms. In the indexing phase some dimensional reduc- tion techniques (Term Normalization) are applied. The most popular indexing technique is the use of Inverted Indexes, that consists in having a inverted list for each index term. Some pre-process over the terms before indexing include:
• stopwords removal: avoids the indexing of irrelevant information by filtering out words with high frequency of occurrences is text that they lose their utility as search keywords and/or words without semantic importance such as articles, prepositions, pronouns, etc.
• stemming: a stemmer is an algorithm that given a word form determines its stem form. The stem is not necessarily identical to the root of the word. As an example, for English, an stemmer will possibly identify the string “build” as the stem of the following word forms: “building”, “builders”. The Porter algorithm is very widely used as a standard stemmer for English (Porter, 1997). This method removes the commoner morphological and inflexional endings from words in English.
• lemmatization: a lemmatizer is an algorithm that given a word form determines its lemma by using the part of speech of the word in a sentence. It requires a lexicon that store the necessary knowledge of the language (i.e. a lemma and its associated lexeme, the pair <word form, part-of-speech>). lemmatization differs from Stemming in the fact that requires the knowledge of the POS tag of the word in the sentence and needs a knowledge base of lexemes. Stemming does not take into account the function of the word in the sentence, does not require a great knowledge of the language, and normally works by stripping morphological and inflexional endings of the words. As an example, the word “went” has “go” as a lemma, but its stem is the word form itself.
19JIRS. http://sourceforge.net/projects/jirs/ 20Sphinx. http://www.sphinxsearch.com
62 Chapter 3. Geographical Information Access Tasks - State-of-the-art • Named Entity indexing: indexing Named Entities as a multi-word class can im- prove the recall and avoid noise in the retrieval. However, a high precision NERC is required in order to lose recall. (Prager et al., 2000) started this approach by indexing Named Entities and their class (predictive annotation). This method identifies po- tential answers in the text and then indexes their corresponding Named Entity class or Expected Answer Type.
• semantic indexing: using WordNet synsets to index collections can improve the recall of IR systems respect to word based indexing. Gonzalo et al. (1998) used the SMART IR and SemCor (a disambiguated collection) to index by synsets with dubious results. In fact the increase in recall (29%) has a decrease in precision counterpart due to polysemy. What is true is that with accurate WSD module (currently not existing) the results could be good. Mihalcea and Moldovan (2000) experiments indexing by synsets reported also an improvement in IR effectiveness using the Cranfield collection. Liu et al. (2004) used effectively WordNet to disambiguate word senses of query terms.
Searching
Searching documents in IR systems implies the use of a textual query in a boolean or ranked manner to obtain a set of ordered or unordered relevant documents. Boolean searches involve the use of logical operators such as: AND, OR, and NOT over the query terms to find a set of documents that satisfy the logical expression. Ranked retrieval, on the other side, does a ranking over a set of documents based on keywords similarities.
IR systems sometimes offer capabilities like phrasal search (searching for a phrase or a specific sequence of words (e.g. “Tom Cruise”)), fuzzy matches (e.g. “*at” will match “Pat” or “rat” ), regular expression (regexp) matches or boosting terms (i.e. weighting search terms). A frequent approach in Searching is Query Expansion (QE). The QE approach is often used to increase the recall of the system by adding similar terms to the ones in the original query. WordNet has been used for this purpose by expanding terms with its synonyms, hyponyms, and hypernyms21. On the other hand, Gazetteers, encyclopedic
knowledge, and abbreviations, can be used in certain domains to realize QEs.
The number of documents to retrieve depends on the task. In QA, normally it depends on the document processing capability of the system. The processing capability depends on the computational resources available to process and the computational costs of the algorithms designed to process the documents. Sometimes deep NLP approaches might require expensive computational resources and processing time and use only few documents (and/or passages), and some simple approaches with lesser requirements can cope with more data.22
In the Information Retrieval field, for research purpose the first top 1,000 documents are taken into account to evaluate the systems (e.g. TREC, and CLEF adhoc IR tasks). In the real world, normally the user wants the search engines for no more than 50 documents. For QA, usually few documents/passages are used to extract the answer. In PR the searching
21Without a good WSD this kind of expansion has to be done very carefully for avoiding the introduction
of noisy terms.
22In online-QA the response time is a critical constraint while in TREC or CLEF contests time process
3.2. Geographical Question Answering - State-of-the-art 63 process retrieves passages sometimes with overlapping and sometimes with fixed size. Jorg Tiedemann (2004) does comparison of different IR systems for QA, in which Zettair and Lucene obtained the best results.
An often used approach to improve searching is Relevance Feedback for IR/PR. Rele- vance Feedback (RF) consists in using the most relevant terms collected from the top ranked documents of an initial query to compose manually or automatically a second query with more information.