Disgustos del colegio estudiantes 5°
DEL ESPACIO
5. Marco Metodológico Paradigma Cualitativo
In the previous section, the basic principles of document retrieval have been outlined: Documents and queries are represented as vectors with a tf‐idf weighting schema and their correspondence is calculated using the vector similarity measure. However, simply using every word of a document as an index term is not a good approach because not all words are equally significant. Terms with a low discriminative power (idf value near 0), i.e. terms which occur with high frequency in any document of the collection, like the words “the”, “or”, “and”, “a”, etc. are not useful index terms. Therefore, a document is processed by one or more analyzers before being indexed. An analyzer is a combination of several text operations like tokenization, stopword removal, and stemming.
Next, the frequently used tokenization, stopword removal, and stemming operations are explained in more detail.
2.3.1 Tokenization
Tokenization is the process of splitting the character stream of a text document into separate tokens. A trivial approach of doing this would be to split the character stream at every space character. However, considering only space characters wouldn’t yield optimal results. Punctuation marks, quotation marks, exclamation marks, quote signs, hyphens, and many other characters must be considered when processing the character stream. According to [Fox 1992], special attention has to be
given to punctuation characters, hyphens, digits and letter case. A punctuation mark might indicate the end of a sentence or it might be an integral part of the word. As an example consider the word “EC3.4.21.69” which refers to a serine protease enzyme. Removing the punctuation marks would put the digits out of context. On the one hand, a query for “EC3.4.21.69” will still return the document as the query is processed by the same tokenizer. On the other hand, a query for the number “21” will result in false positive hits. A common approach for dealing with this issue is to add special rules to the tokenizer. In the example the “EC”‐numbers would be treated differently, i.e. they are not tokenized at dot characters. In just the same manner hyphens (e.g. in department names like “PH‐RT”) and digits (e.g. in time “10a.m.”) might be an integral part of a word, leading to similar contextual problems if tokenized. The last point mentioned by Fox is the letter case, e.g. “General Motors” vs. “general motors”. Ignoring it might result in the loss of the word’s true semantics. In the example we would loose the knowledge that “General Motors” refers to a company. Nevertheless this problem is in general ignored, i.e. the character stream is either made lower case or upper case.
2.3.2 Stopword removal
In this process, words which have a high frequency across the document corpus are removed, i.e. they are not considered as index terms. High‐frequency words like “a”, “the”, and “is” are not good discriminators as they usually occur in almost all documents. Another benefit gained by removing stopwords is the size reduction of the index structure by 40% or more. While removing stopwords has its clear benefits it can as a side effect also reduce recall (i.e. the amount of returned documents considered relevant; cf. Chapter 2.7). Deciding upon which words to include in the stopword list is thus a crucial task. Many different stopword lists exist and the inclusion or exclusion of stopwords is often dependent on the targeted corpus – indeed, in a corpus about logic words like “and”, “or”, and “not” would be considered relevant. A list of general stopwords for the English language can be found e.g. in [Fox 1989].
2.3.3 Stemming and lemmatization
Stemming is the process of removing prefixes and/or suffixes from a word. Consider for instance the words “connect”, “connected”, “connecting”, “connection”, and “connections”. These words have a similar meaning and can thus be conflated into a single term by removing the suffixes “‐ed”, “‐ing”, “‐ion” and “ions”, yielding the stem “connect”. Stemming can thus reduce complexity by reducing the number of indexed terms and hence the size of the index structure. Another advantage is that relevant documents can be found regardless of the used query word variation (like singular, plural, past tense, ...).
Despite its advantages, stemming can also raise new problems. There are cases, where words with a distinct meaning are conflated, i.e. they have the same stem. As an example consider the words “wand” and “wander”, which obviously have different meanings. They are conflated together, receiving “wand” as a stem. Another example are the words “new” (adjective) and “news” (announcement), which are conflated to “new”. Between these two extremes, of similarity and
dissimilarity, there is a continuum of cases where one can argue in favor or against conflating.
The different stemming algorithms [Smirnov 2008] being described in literature focus mostly on suffix removal because most word variations are introduced through suffixes. The most popular suffix removal algorithm is the one developed by [Porter 1980]. It is simple, fast, elegant, and it yields a similar performance as more complex algorithms. An example of the Porter stemmer is given in Table 2‐1.
Lemmatization is closely related to stemming. While stemming uses an algorithmic approach based on heuristics, lemmatization is based on vocabularies and morphological analysis of words. Lemmatization returns only the base of a word form as given in the dictionary, namely the lemma. For instance, lemmatizing the word “saw” yields either “see” or “saw” depending on whether the used token was a verb or a noun. In contrast, the heuristics used in stemming algorithms might conflate the word to “s”. Therefore, lemmatization provides a higher quality in terms of retaining a word’s semantic. The improvement comes at the cost of higher implementation efforts as well as a slower runtime of the algorithm.
The usage of a stemming algorithm is not obligatory. In fact due to the reduced costs of storage space and due to the disputes about the benefits of stemming for IR [Frakes 1992] many search engines ignore stemming completely. Of course, if morphological word variations are still to be matched other methods must be applied such as query expansion. Query expansion simply means to add additional terms to the query. Each word is expanded by its variants, achieving a similar effect as stemming. The query “connect” for instance, is expanded to “connect OR connected OR connecting OR connection OR connections”, so that all variants are covered. However, this approach can become expensive in terms of computation time when long queries are processed.