• No se han encontrado resultados

4. Sesiones propuestas

4.1. Sesiones para Secundaria

4.1.9. Sesión 9: Convivencia escolar

The Ling-Spam Corpus

With the increase use of email above regular mail, the opportunity of advertisement via email have increased dramatically. When such mail advertisement is unsolicited, it is commonly referred to as SPAM. Since the cost of advertisement through media such as television, newspaper or magazines is much more expensive, the popularity of SPAM increased dramatically in recent times. In fact, the number of SPAM emails is starting to overwhelm the number of legitimate emails—to such a degree that it is feared that SPAM may cause the demise of the use of email, as users find it too cumbersome to sort out legitimate messages from SPAM. Most users find SPAM at least annoying, if not blatantly offensive, especially since a large proportion SPAM contains (graphic) advertisement for pornographic sites.

Different strategies to combat this threat exist. On the one hand there is legislature, e.g. the “Controlling the Assault of Non-Solicited Pornography and Marketing Act of 2003,” as passed on December 16, 2003, by the United States Congress. In some cases the law can be applied effectively, e.g. an Internet service provider, CIS Internet Servers, won a lawsuit against SPAM senders who were sending up to 10 million SPAM messages per day to their server. The law dictates that SPAM senders be fined $10 per message, and the total damages amounted to one billion dollars [CNET News.com, 20 December 2004]. However, with the email protocols in use today the enforcement of such laws is often undermined, as it is difficult or often impossible to identify where the SPAM was sent from. Thus, another approach is the

Table 11.5:Performance comparison of different classifiers on the LingSpam corpus. Classification Method Classification Accuracy SPAM Recall SPAM Precision FCF 98.17 92.95 95.42 Naive Bayesian 96.93 82.35 99.02 TiMBL(1) 96.89 85.27 95.92 TiMBL(2) 96.75 83.19 97.10 Outlook Patterns 90.98 53.01 87.93 TiMBL(10) 89.08 34.54 99.64 No Filter 83.37 0 ∞

proposal of new email standards that removes the anonymity of email. A third approach is identifying SPAM and automatically deleting it. One common approach to distinguish between different classes of text documents is the use of Bayesian classifiers such as Naive Bayesian [Mitchell, 1997]. This method also proved relatively successful to separate SPAM from HAM (a term for legitimate email) [Sahami et al., 1998; Schneider, 2003], outperforming advanced rule learners such RIPPER [Pantel and Lin, 1998].

The Ling-Spam corpus is a publicly available corpus of SPAM and legitimate messages4. The corpus contains 2893 messages sent via the mailing list Linguist. Linguist is a moderated mailing about the science of linguistics5. Approximately 16% of the messages in the corpus is SPAM, and the labelling was done by hand to minimize noise. Although the corpus covers mainly the domain of linguistics, legitimate messages also include, for example, job postings and software announcements.

Experiments

To induce a fuzzy rule set capable of distinguishing between SPAM and HAM, the different text docu- ments are first preprocessed into feature vectors. We used the freely available software FeatureFinder6for feature extraction. FeatureFinder uses mutual information to select a user-defined number of features. The feature types that can be created include TF (term frequency) and TF-IDF (Term Frequency / In- verse Document Frequency). Let theith feature have the ith greatest mutual information, letT F

i be

the number of occurrences of theith feature in a given document, and let |D| be the total number of

documents, thenfi, the TF-IDF of theithfeature for a given document is calculated as, fi= log |D|

T Fi

(11.2) To create a fuzzy training set for our experiments we first extracted 500 TF-IDF type features for each document. We then extracted membership functions from each feature using the approach described in Appendix C, where we allowed up to four linguistic terms per linguistic variable. However, in general the extraction process suggested the use of three membership functions.

Sakkis et al compared the performance of an adapted k-nearest neighbour classifier called TiMBL

4

The Ling-Spam corpus can be downloaded at http://www.dcs.ex.ac.uk/corpora/

5

An archive of the Linguist mailing list is available at http://listserv.linguistlist.org/archives/linguist.html

6

[Daelemans et al., 2000] for values ofk = 1, 2, 10 with that of Naive Bayesian and MicroSoft Outlook patterns on the Ling-Spam corpus [Sakkis et al., 2003; Androutsopoulos et al., 2000]. We repeat their results along with that obtained using FCF in Table 11.5. FCF was configured as follows, αc = 0.5, αa = 0.2, beamwidth = 1, θp = 0, simultaneous concept learning, Laplace evaluation, and FUZZ-

CONRI as specialization model. SPAM recall is a measurement of the percentage of SPAM documents correctly identified with respect to all SPAM documents, and is thus equivalent to the True Positive mea- surement. SPAM precision is a measurement of the accuracy of a prediction, and is computed by the percentage of correctly identified SPAM documents with respect to all documents identified as SPAM, and is thus equivalent to(1− False Postive Ratio)× 100%. Classification accuracy measures the number of correctly classified documents, where the classification is either SPAM or HAM. The No-Filter clas- sifier classifies all documents as legitimate, and accordingly has zero recall. It’s classification accuracy is83.19%.

FCF outperformed all the other methods with respect to classification accuracy. It obtained a classifi- cation accuracy of98.17%, while the second best classifier, Naive Bayesian, obtained a classification accuracy of96.93%. FCF also significantly outperformed all other classifiers on SPAM recall. It ob- tained92.95% recall, while the three next best classifiers, TiMBL(1), TiMBL(2), and Naive Bayesian, obtained recall percentages 85.27%, 83.19, and 82.35%, respectively. FCF was thus much more suc- cessful at identifying SPAM messages than any of the other learners. FCF’s SPAM precision was slightly worse, but still comparable to that of the other classifiers. In the order of best recall, the precision of FCF, TiMBL(1), TiMBL(2), and Naive Bayesian, were 95.42%, 95.92%, 97.10%, and 99.02%. We provide the rule set induced by FCF during one fold of the 10-fold cross validation in Appendix E.