Rapp (1999) and Koehn and Knight (2000) suggest that commonly occurring strings that do not help in processing natural language data should be re-moved, and Fung (1995) suggests removing the function words from the texts to increase the values of many nouns. Similar approach is also found in the IR field.
In an IR setting, dividing text vocabularies into two classes (i.e., the stop words and the content bearing words) is a custom. A stop list, or negative dictionary is a device usually used in an automatic indexing system to filter out words that would make poor index terms. For example in the searching process, English words such as ”a,” ”and,” ”is,” and ”the” are left out of the full-text index since they are deemed unlikely to be useful for searching. The advantage of using the stop list is that this technique helps reduce the size of an inverted index up to by half, hence effective indexing can be achieved.
However, once the stop list is applied in a system, phrases containing stop words are totally removed and can never be searched again in the system.
Hans Peter Luhn, one of the pioneers in IR, should be credited for coining the phrase and for starting the concept in his design. Stop lists have been
3.3 Related approaches
constructed for the English languages and most of the major European lan-guages. Developed for English based on frequency statistics of a large corpus (Zipf, 1932) such English stop lists can be easily retrieved online.
Stop words in text collections can be generally divided into two types: 1.
generic stop words, and 2. domain stop words. A generic stop list includes words that can be eliminated at any circumstances whereas a domain stop list includes stop words which can only be effective in certain domain.
An English generic stop list typically consists about 200-400 words includes ar-ticles, prepositions, conjunctions and some high frequency words. The domain stop list contains repetitive words in domain specific documents. For example, words such as states, system and government appear too frequently as can-didates for translation when a bilingual lexicon is learnt from the Europarl, in which consisting parliament proceedings. To make up for high frequency there is a suggestion to reduce the dispersion weight of distributional criteria as follows:
w‘(t) = w(t) d(t) where
w(t) is the weight that t had as a candidate for some term, d(t) is the number of times t has been proposed as a candidate.
A stop word is often associated with low variance and comparatively high fre-quency in the whole corpus. Conventionally, stop lists are supposed to include the most frequently occurring words. However, in practice, it may also in-clude infrequent words, and not all most frequent words. A classic method by Christopher Fox in 1990, which were manually aided by frequency statistics of the Brown corpus (this corpus contains 1,014,000 words that had been drawn from a broad range of English literature), had generated a stop list containing 421stop words that might differ from other lists available today. This method kicked off with a list of tokens occurring more than 300 times in the Brown
3.3 Related approaches
corpus. From this list of 278 words, 32 were culled on the grounds because they were too important and had potential to be index terms. Then, 26 words were added to the list as these words occurred very frequently in certain kinds of literature, and 149 words were added to the list because the finite state machine based filter, in which this list is intended to be used, was able to filter them at almost no cost. The final product was the list of 421 stop words that was used to filter most frequent words occurring in English literatures in the past.
Previous studies for constructing stop lists, automatically or semi-automatically, are available. One of the studies is based on the term frequency, a careful manual elimination process and an assumption that: not every most frequent words in the stop lists should be considered. This study focuses on elimi-nating terms that carry significant information, although the terms are found to be occurring quite frequently in the corpus. In this study, an experiment on document collections that were restricted to a specific politics domain was conducted to create a stop list. In this experiment, certain words carrying significant information (such as “President” and “France”) were found to be highly ranked, thus these word were then eliminated manually from the stop list.
Other methods uses an automated statistical testing based on the IDF to identify stop words in a collection. For these methods, a stop word is seen as a word that has the same likelihood of occurring in documents that are not relevant to a query as in documents that are relevant to the same query. The strength of a term and how strongly the term’s occurrences correlate with the subjects of documents in the database are measured. If term occurrences are random then there will be no correlation and the strength will be zero, but, if for any subject the term is either always presents or never present the strength will be one.
3.3 Related approaches
Although the IDF provides a useful global weight for terms, the frequency of a term in the database is not the only factor bearing on its usefulness as a key term for document retrieval. Infrequently used terms might also not relate to the specific content of documents. A statistical method of judging the function of a term might be needed.
There is also other automatic model, which is based on a complex statistical model that assigns weights on each term using the Kullback-Leibler diver-gence measure (TszWai et al., 2005). A stop list constructed based on this term-based random sampling approach requires less computational effort, how-ever, the quality of the stop list is slightly worse than the classical stop lists constructed on term frequency. A merging between this stop list and Fox‘s classical stop list is suggested in this study.
Stop word identifications for other languages than English are also discussed in many previous studies, including by Hao (2008) and Alajmi et al. (2012).
For the Chinese language, text tokenizing would be more difficult than in other natural languages because the word boundaries are not well defined. There-fore, a segmentation algorithm has to be employed first before a statistical model can be built for engineering the stop list. Generally, this statistical model is also based on the term frequencies, but the term frequencies are then normalized using document lengths before the probability of each potential stop word become a stop word is calculated.
As a summary, removing stop words is a common practice to reduce index size without affecting the accuracy of an IR system. Likewise, a similar practice is found in bilingual lexicon extraction field. A stop list could be generated au-tomatically. A stop list for generic use is best learnt from a very large corpus.
Standard stop lists for English and some other major languages are already available, however, for most languages, the construction either manually or automatically is still required.
3.3 Related approaches