Capítulo I. Preguntando por la crossmedia y la participación ciberciudadana
Capítulo 4. Apropiación de las TIC, participación ciberciudadanía, prosumisión y tránsitos
4.1. Prosumidores del CaféSM
4.1.1. César Sanchez C
Many types of language processing and text analysis methods use stemming techniques to remove affixes (suffixes and prefixes) from root words (Frakes and Fox, 2003), in order to reduce the number of words and to obtain an exact matching word (Ramasubramanian and Ramya, 2013) that closely approximates the root morpheme of a word. For example, the word “children” stems from the word “child”.
A stemming technique is a computational process the purpose of which is to reduce all common form words to the morpheme word root by stripping each word of its derivational and inflectional suffixes (Smirnov, 2008). Therefore, a stem technique is employed to maximise the usefulness of the subject terms.
41 Several stemming techniques are available for reducing and removing words including: Snowball, Porter, Lovins, German, Arabic, and Dictionary. Snowball stemming is a designated small string language process intended to construct stemming algorithms for information retrieval. The stemmers in Snowball stemming can be precisely characterised and can be generated by fast stemmer programs in Java (Porter, 2001).
Another stemming technique of Porter's algorithm was developed for the stemming of English‐language texts. However, an increased amount of information retrieval has become very important since the 1990s, creating widespread interest in the development of conflation techniques. These strengthen the discovery process of texts written in other languages. Currently, the Porter algorithm is the standard technique for stemming English; therefore, it is a natural model within the area of language processes (Porter, 2005; Porter, 2006).
The first stemmer to be proposed, particularly in retrieval applications, is Lovins stemming. Apart from its retrieval application, this stemming includes stemming according to the dictionary of common suffixes, for instance, DES, ING or TION (Porter, 2005). Lovins stemming has initiated and promoted the improvement of prevalent algorithms, as well as being a basic, general application tool for information retrieval (Frakes and Fox, 2003).
In addition, stem dictionary is a processing stem based on the dictionary. When a word undergoes the stemming process in a dictionary-based stemming algorithm, a general search for the presence of any suffixes in the dictionary at the right-hand end of the word is executed. When the presence of a suffix is discovered, it is then extracted, and exposed to a range of context-sensitive rules. These may involve the removal of *ABLE from TABLE or of *S from GAS; additionally, a spectrum of recording
42 rules may be arranged and supplied to facilitate the conflation of variants such as “Forgetting” and “Forget” or “Absorb” and “Absorption” (Porter, 2006).
One of the several kinds of stemming techniques related to the Indonesian language stemmers is the approach used by Nazief and Adriani (Nazief and Adriani, 1996; Adriani et al., 2007). The Nazief and Adriani approach is based on comprehensive morphological rules in which the grouping and encapsulating processes allow and forbid affixes, as well as prefixes, suffixes, infixes (insertions) and confixes (combination of prefixes and suffixes) (Asian et al., 2005; Adriani et al., 2007). In the grouping of basic affixes, the following procedures are applied: the first step, the affix, which is the inflexion suffix that does not alter the root word. For instance, “makan” (to eat) may perhaps be suffixed with “-lah” to express “makanlah” (please eat). The second affix is a derivation suffix that is directly added to the root word. In this case, there can be one exclusive derivation suffix per word. An example is the word “baca” (to read) that can be suffixed by the derivation suffix “–kan” to become “bacakan” (please read). The third affix is the derivation prefix, which is employed either directly to the root words, or to the words possessing up to two other derivation prefixes. To illustrate, the derivation prefixes “mem-” and “per-” may be prepended to “jelek” (bad) to produce “memperjelek” (the act of ill-favouring) (Asian et al., 2005). Re-coding, an approach that locates and recovers an initial letter that was previously taken out from a root word before prepend of a prefix, is also provided by the algorithm. Moreover, the algorithm controls the use of an auxiliary dictionary comprised of root words applied in most of the steps used to determine whether or not the stemming has reached a root word (Nazief and Adriani, 1996; Adriani et al., 2007).
A stemming technique developed by Arifin and Setiono (2002) employs a dictionary to deliberately extract affixes and manage the re-coding process. The
43 purpose of their approach is to eliminate all prefixes and suffixes; the process halts when the stemmed word is discovered in a dictionary, or when the number of affixes that has been removed has reached a maximum of two prefixes and three suffixes. When the removal of prefixes and suffixes has been completed, if the stemmed word still cannot be searched, the affixes will then be revived in the word in every possible combination, so as to minimise the possibility of having stemming errors (Arifin and Setiono, 2002; Asian et al., 2005). For instance, the word “disamakan” (to be equal) contains the root word “sama” (equal). Following the first step of eliminating the prefix “di-”, the word that remains is “samakan” (to equalise).
A different stemming technique by Vega (2001) does not require a dictionary as compared to other approaches. Rules are applied to each stemmed word in order to divide the word into smaller units. For instance, the word “didudukkan”, which translates as ‘to be seated’, might be expressed by the following rule: (di) + stem (root) + (i | kan) (Vega, 2001; Asian et al., 2005).
Additionally, a stemming technique by Ahmad et al. (1996) has two distinctive characteristics: first, the approach was intended to be closely-associated with the Malaysian language, instead of Indonesian; and second, it is a straightforward approach. A list of all prefixes, suffixes, infixes and confixes in order and validity are maintained. Prior to the stemming process, the algorithm seeks the word in the dictionary, and successfully restores the original version of the word. If the word cannot be located in the dictionary, the next rule in the rule list will then be applied to the word (Ahmad et al., 1996; Asian et al., 2005). Say, when the infix rules are applied before the prefix rules, “berasal” (to originate) — for which the precise and appropriate stem is “asal” (an origin or source) — is stemmed to “basal” (basalt or dropsy) by extracting the infix “-er-” preceding the prefix “ber”.
44 Of these four approaches explained above, Nazief and Adriani’s (1996) approach has been adopted in this research on account of its ability to eliminate and to embed prefixes and suffixes, which are able to decipher the words close to their true definition. Therefore, the research in this thesis has extended the Indonesian stem dictionary specifically to create and implement an Indonesian stem dictionary of insulting words. The purpose of this dictionary is to identify and remove the suffixes and prefixes, especially in Indonesian insulting words. Our stem dictionary will help to identify some Indonesian insulting words that usually appear in social network messages.