Dimensionality reduction for the improvement of anti-spam filters

(1)

DOCTORAL THESIS

DIMENSIONALITY REDUCTION FOR THE IMPROVEMENT OF ANTI-SPAM FILTERS

IÑAKI VÉLEZ DE MENDIZABAL GONZÁLEZ | Dimensionality reduction for the improvement of anti-spam filters

IÑAKI VÉLEZ DE MENDIZABAL GONZÁLEZ | Arrasate-Mondragón, 2022

(2)

T

HESIS

Dimensionality reduction for the improvement of anti-spam filters

Author:

Iñaki VÉLEZ DEMENDIZABALGONZÁLEZ Supervisors:

PhD. Urko ZURUTUZAORTEGA

PhD. Enaitz EZPELETAGALLASTEGI

PhD. Vitor BASTOFERNANDES

PhD Program in Applied Engineering Electronics and Computing Department

Faculty of Engineering Mondragon Unibertsitatea

Arrasate

(3)

(4)

”Familia eta lagun guztiei.”

Mila esker

A Maribel, a Maialen y a Unai.

Iñaki

(5)

(6)

Acknowledgments

Susmoa neukan horrelako bidaiak ez zirela bakarrik egiten eta orain guztiz argi daukat.

Lan hau nire inguruan egon zareten guztion fruitua da. Eskerrik asko tunelaren barruan argi apur bat egin duzuenoi, milesker motxila honen pisua eramaten lagundu didazuenoi eta nola ez, eskerrak bihotzez etengabe aurrerantz bultza egin didazuenoi.

Hoien artean bereziki denbora guztian nire ondoan egon zareten Urko Zurutuza eta Enaitz Ezpeleta. Momentu txarretan baikor, beldur momentutan ausart, zalantza momentutan sendo eta denbora guztian lagun. Zoragarria izan da bide hau zuekin batera egitea; edo hobeto esanda, zuen atzetik egitea, beti ibili zarete eta bi pauso nire aurretik. Ezin duzue imaginatu zer nolako lasaitasuna ematen duen horrelako zuzendariak edukitzea. Obrigadisimo Vitor Manuel Basto Fernandes por sua ajuda.

Thank you for receiving me in Lisbon and for helping me in ISCTE-IUL.

Gracias también a J. R. Méndez, Moncho. O meu anxo da garda. Compañero desde hace tantos años en este camino que parecía no terminar nunca. A todo el grupo SING de la Universidade de Vigo que estais en Ourense, graciñas.

Eskerrik asko Mondragon Goi Eskola Politeknikoko lankide guztiei bidai hau egiteko aukera eta medioak eskeintzeagatik. Datuen Analisia eta Zibersegurtasun taldeko guztioi bereziki besarkada handi bat. Orain faborea bueltatzeko nire txanda da. Ikerketa honetako experimentuak egitea posible egin duzun Antton Rodriguez eta azken metroak egiten lagundu didazuen Ekhi Zugasti, Xabier Vidriales eta Iban Barrutia, zuei ere eskerrak bihotz bihotzez. Amigos do ISTAR ISCTE-IUL, levar-vos-ei sempre comigo.

Eta gertukoenei, aita eta ama, zaudeteneko tokian, pena bat izan da ezin lehenago amaitzea zuekin ospatzeko. Maribel, Maialen eta Unai, eskerrik asko zuen pazientzi- agatik, laguntzagatik eta oraindik zer den ulertzen ez duzuen honetan laguntzeagatik.

Zuek guztiok gabe hau ezinezkoa izango zen.

(7)

(8)

Originality Statement

I declare that I am the sole author of this work. This is a true copy of the original document, including any revision which may have been ordered by my examiners. I understand that my work may be available to the public, either in the school library or in electronic format.

Iñaki Vélez de Mendizabal González Arrasate, July 2022

(9)

(10)

Abstract

Nowadays, spam represents more than 45% of the world’s email traffic. Filtering techniques to combat the problem of spam distribution have been the subject of many research studies in recent years. Several combinations of legal, administrative and technical perspectives were tested. The combination of technical approaches, namely, the widely exploited content-based and token-based filtering techniques, revealed low significance improvements on spam classification performance. Due to the limited performance of token-based strategies, new knowledge representation schemes (such as those based on word-embeddings, topics, or synsets) have been developed. The use of synsets to represent the meaning of the words guides the community towards the identification of the intentionality of a message, allowing the classification of messages that want to sell products, obtain information about us, etc. The advantage of this kind of synsets representations lies on the capability to taxonomically group concepts, han- dling the polysemy and synonymy. These properties have been successfully exploited in this research work to design a novel Machine Learning (ML) based lossless feature reduction schemes by grouping concepts strategies. This type of reduction schemes has achieved a reduction in the classification problem dimensionality (number of features), improving the classification performance. In a second step we introduce and demonstrate the effectiveness of a new feature reduction scheme that combines the strengths of lossless and lossy strategies. Finally, in order to use the Leetspeak encrypted words, a decoder has been designed and tested. The proposed system reduces the number of unprocessed words considerably, improving the classification rates of spam messages.

key words:

Spam, Synset-based representation, Semantic information, Multi-objective evolutionary algorithms, Leetspeak, deobfuscation

(11)

(12)

Laburpena

Gaur egun spam mezuek mundu osoko email trafiko globalaren %45-a suposatzen dute. Azken urteetan spam-aren arazoa konpontzeko tekniketan ikerketa ugari egin dira. Soluzio desberdinak probatu dira alderdi legalak, administratiboak eta teknikoak nahastuz. Ikuspuntu tekniko batetik edukietan eta token-etan oinarrituriko teknikek hobekuntza eskasak lortu dituzte. Azken hauek lortutako emaitzak hobetzeko, mezuen barruko informazioa errepresentatzeko era berriak garatu dira (adierazpen bektoriala, gaiak edo synset-ak). Hitzen esanahiak erabiltzeak mezua zein asmorekin idatzia izan den asmatzera bideratzen gaitu, produktuak saldu nahi dituen mezu bat bezala klasifikatuz, informazioa lortu nahi duen mezu bat bezala, etabar. Informazioa errepresentatzeko metodo berri hauek kontzeptuak elkartzeko gaitasuna daukate, esanahi desberdineko hitzak eta esanahi bereko hitzak taxonomikoki azteretuz. Propietate hauetan oinarrituz, ikerketa lan honetan, informazio galera gabeko ezaugarri kopuru murrizketa lortzen duen sistema bat garatu da, zein Ikaste Automatikoan oinarritzen den kontzeptuak elkartzeko. Honi esker arazoaren dimentsioa (tamaina) gutxitu da mezuen sailkapenaren errendimendua hobetuz. Bestalde, garaturiko lan honen aban- tailetan oinarritzen den bigarren sistema bat ere garatu da, non informazio galera gabeko sistemaren sendotasuna, informazio galera txiki batekin konbinatzen den.

Amaitzeko Leetspeak-ean kodifikaturiko hitzen informazioa berreskuratzeko dekodi- fikatzaile bat garatu da. Garaturiko dekodifikatzaileak berreskuratzen dituen hitzen informazioari esker, klasifikazioaren emaitzak hobetu egiten dira.

(13)

(14)

Resumen

Actualmente el spam representa cerca del 45% del trafico mundial de emails. En los últimos años las técnicas de filtrado para combatir el spam han sido objeto de innumerables estudios. Se han probado distintas soluciones combinando aspectos legales, administrativos y técnicos. Desde el punto de vista técnico, la combinación de técnicas de filtrado basadas en tokens y técnicas de filtrado basadas en contenidos han traído mejoras poco significativas en las tasas de clasificación del spam. Debido a las limitadas mejoras conseguidas con estas estrategias, se han desarrollado nuevos esquemas de representación del conocimiento (como las representaciones vectoriales, temas o synsets). El usar synsets para representar el significado de las palabras nos guía hacia la identificación de la intencionalidad de un mensaje, permitiendo clasificarlos como mensajes que quieren vender productos, obtener información sobre nosotros, etc. La ventaja de este tipo de representaciones está en su capacidad de agrupar taxonómicamente los conceptos, resolviendo la polisemia y la sinonímia.

Estas propiedades han sido utilizadas con éxito en este trabajo de investigación, para diseñar un nuevo esquema de reducción de características sin pérdida de información mediante agrupaciones de conceptos basado en técnicas de Aprendizaje Automático.

Gracias a este esquema de reducción, se ha conseguido reducir la dimensionalidad del problema de clasificación (número de características), mejorando el rendimiento.

En un segundo paso, presentamos y demostramos la eficacia de un nuevo esquema de reducción de características que combina los puntos fuertes de la estrategia sin pérdida de información combinándola con una leve pérdida de información. Por último, para recuperar la información de las palabras cifradas mediante Leetspeak, se ha diseñado y probado un decodificador. El sistema presentado reduce considerablemente el número de palabras cifradas (ofuscadas) que se quedan sin procesar, mejorando los índices de clasificación de los mensajes de spam.

(15)

List of Figures

2.1 Disambiguation problem for the word "bank". . . 19

2.2 Genetic algorithm iteration. . . 21

2.3 Examples of images attached to spam messages. These images are part of publicly available "Image Spam Dataset". . . 21

4.1 Experimental protocol. . . 34

4.2 Multiple line chart representing solutions. . . 41

4.3 TCR benchmarking results. . . 41

4.4 3D Pareto front. . . 42

4.5 Performance comparative of different feature-reduction schemes. . . 43

4.6 Batting average using tokens and synsets. . . 43

5.1 Low-loss approach experimental protocol. . . 49

5.2 Pareto front of low-loss scheme. . . 51

5.3 Pareto front of lossless scheme.. . . 51

5.4 Low-loss approach solutions sorted by DIMr. . . 52

5.5 Lossless approach solutions sorted by DIMr. . . 52

5.6 Top five configurations achieved by low-loss and lossless reduction schemes. 53 6.1 Images that are part of the set of the character A. . . 58

6.2 Training and validation accuracy and loss. . . 62

6.3 Experimental protocol. . . 63

6.4 CNN confusion matrix. . . 65

6.5 Experimental protocol achieved accuracy. . . 65

6.6 Experimental protocol achieved f-score. . . 68

(18)

List of Tables

2.1 Simple count feature vector representation. . . 11

2.2 AOI manufacturing process variable generalization step 0. . . 16

2.5 Values for parts 2,3,4, and 5 are stored in a cluster. . . 18

2.6 New cluster is generated for parts 8 and 9. . . 18

2.7 Final state of the parts manufacturing table. . . 18

2.8 Clusters generated final state of the parts manufacturing table. . . 19

2.9 Word "viagra" writen in Leetspeak.. . . 22

2.10 Examples of posible Leetspeak substitutions. . . 23

3.1 Bahgat et al. feature reduction proposal, based on the same synonyms. . . 27

3.2 Feature reduction matching top level synsets of Wordnet by Mendez et al. . 27

4.1 BabelNet synset based tokenized dataset (D) feature vector. . . 32

4.2 Transformation and reduction of dataset D applying the chromosome C = {1, 0, 2, 1, 0}. . . 32

4.3 Time required in days to complete the experiment.. . . 34

4.4 Public corpora with ham/spam texts. . . 35

4.5 Geometric distance in relation with the value of γ. . . 37

4.6 Tokens and synsets classification results without feature selection. . . . 38

4.7 Token based classification results applying Information Gain for feature selection. . . 39

5.1 Minimum Euclidean distance depending on the gamma value. . . 50

5.2 Top 10 synset marked for removal, Information Gain (IG) and meaning. 54 5.3 Top 10 synsets that are maintained and not removed, IG and meaning. . 54 5.4 Part Of Speech (POS) analysis of results achieved by the low-loss approach. 55

(19)

6.1 Obfuscated characters examples. . . 60

6.2 CNN layer details for obfuscated character recognition. . . 61

6.3 Precision and recall values for YouTube Comments Dataset. . . 66

6.4 Precision and recall values for email datasets. . . 67

(20)

Acronyms

ANN Artificial Neural Network . . . 22

AOI Attribute Oriented Induction . . . .16

CNN Convolutional Neural Network . . . .5

DIMr Dimensionality Reduction ratio . . . .51

DL Deep Learning . . . .58

FP False Positive . . . .4

FPr False Positive ratio . . . .33

FN False Negative . . . .4

FNr False Negative ratio . . . .33

GA Genetic Algorithm . . . .20

IG Information Gain . . . .xv

IM Instant Messaging . . . .7

IP Internet Protocol . . . .8

IR Information Retrieval . . . .19

ML Machine Learning . . . .vii

MOEA Multi-Objective Evolutionary Algorithms . . . .30

NLP Natural Language Processing . . . .2

(21)

NSGA-II Non-Dominated Sorting Genetic Algorithm . . . .30

OCR Optical character recognition . . . .21

OSN Online Social Network . . . .7

PCA Principal Component Analysis . . . .15

POS Part Of Speech . . . .xv

SMS Short Message Service . . . .7

SVM Support Vector Machines . . . .15

TCR Total Cost Ratio . . . .40

TREC Text Retrieval Conference . . . .7

UCI University of California, Irvine . . . .35

URL Uniform Resource Locator . . . .2

WSD Word Sense Disambiguation . . . .19

XAI Explainable Artificial Intelligence . . . .29

(22)

Chapter 1

Introduction

"Spam is an irrelevant or unsolicited message sent typically to a large number of users, for the purposes of advertising, phishing, spreading malware, etc." - Oxford Dictionaries

This chapter describes the problem that this research addresses. The main objective of the research is described, as well as the technical objectives to that guide the achievement of the research statements. It also presents the starting hypothesis, the work developed to accomplish objectives and finally, the contributions of this work, enumerating the publications that have been produced.

To better understand the problem that needs to be solved, it is necessary to take into account that in just a few years, the Internet has changed the way people communicate, get information and do business, transforming economic and social interactions and relations. From 1,1 billions connected users in 2005 to 4,950 billions in 2022¹. This number of connected users has increased especially as a result of the use of smartphones with an Internet connection. Gartner already reported in 2013 that the sale of smartphones surpassed²sales of feature phones. Many of these users use the Internet legitimately and take advantage of its benefits. However, there are other kinds of users which use the Internet for their own benefit, such as spammers, delivering their content through the Instant messaging [17,92] services, email [19,96] and social networks [19,113].

Many technologies such as collaborative solutions [100], content-based schemes [8, 52, 68, 109] and even network standards, like Request for Comments 63761³ or 72082⁴, have been developed to combat spam, but it has not been completely

1https://www.statista.com/statistics/617136/digital-population-worldwide/

2Available at https://www.gartner.com/en/newsroom/press-releases/2014-02-13-gartner-says-annual- smartphone-sales-surpassed-sales-of-feature-phones-for-the-first-time-in-2013

3Available at https://tools.ietf.org/html/rfc6376

4Available at https://tools.ietf.org/html/rfc7208

(23)

1. INTRODUCTION

eliminated. The European Parliament has addressed the problem from a legal point of view with the European Directive⁵on Privacy and Electronic Communications, but this has not solved the problem either.

In the work of Bhuiyan et. al. [14], a review of anti-spam systems, authors describe their evolution from the most primitive ones that consist in a simple filter to identify the sender’s address to lock it, passing through the first content filters that discarded messages that had a specific word in the subject, up to the most recent ones, on which this research work is going to focus, based onMLand Natural Language Processing (NLP).

MLandNLPtechniques have been widely used to classify spam messages and have demonstrated high classification performance[25]. In [74] a graphical summary of the different used techniques is presented, showing the results obtained in terms of accuracy, false positive and false negative ratios. In order to be processed by theseMLalgorithms, texts have to be represented in a specific way. This form of representation is key to reduce the computational requirements to run the classifier as well as to achieve high performance classification ratios. In [74] we can observe that the classification ratio is more than 70% in all cases, revealing that this techniques have high performance.

Despite all these efforts, spam is still present in about 45% of the messages on the Internet⁶. Spam not only has invaded email, it has also invaded social media, which is used by many users to keep in touch with their friends and family, companies to engage potential customers, and other many cases. With a big amount of users ready to access the contents shared or posted by their contacts, it is not surprising that social media is a usual target for spammers. Moreover, spam has changed from being simply inconvenient to become a cyber threat. Spam messages may include malicious Uniform Resource Locator (URL)s that can redirect the user to malware download pages or phishing sites.

1.1 Research statement

This research work aims to address the spam problem from a novel approach. Currently, most anti-spam filters are based on statistical classifiers that compute the probability of a message to belong to spam or not based on the words in the message. To make this estimation, the classifier has to be previously trained with a set of legitimate messages.

This thesis has the goal of extracting the meaning of the messages received, and react

5Available at https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A32002L0058

6Available at https://securelist.com/spam-and-phishing-in-q3-2021/104741/

(24)

1.1. Research statement

upon it. Using word meanings and semantic dictionaries (free means different in "free people" and "free drugs") this work improves the identification of message types (ham or spam).

The use of the meaning of the words guides the community towards the identification of the intentionality of a message, or the intentionality of the author of the message, allowing the classification of messages that want to sell products, messages that want to obtain information about us, etc. In order to work in this direction, we first dissect the texts making use of semantic dictionaries, disambiguation and grouping processes, to classify messages with reference to a set of defined categories based on the use of semantic dictionaries, disambiguation and text generalization processes to classify messages with reference to a set of defined categories.

1.1.1 Research objectives and hypothesis

This thesis has one research hypothesis and three objectives, which are described next:

Hypothesis: The semantic processing of messages towards intentionality detection, by the means of semantic dictionaries, allows the improvement of spam messages classification performance.

Objective 1: Perform a reduction in the number of tokens in a message by the means of semantic generalization and message meaning simplification.

As an example, it will make it possible to join the tokens "Viagra", "Cialis" and

"Tadalafil" into a single token as "anti_impotence_drug". This will speed up the training process and reduce the memory requirements for classification, increasing the performance.

Objective 2: Identification and the reduction of irrelevant tokens to improve the classifier performance and computing time.

Therefore we want to focus on the core of the message, that would leverage the intention. There are stop words, which are words that do not provide information (such as "at", "in", "the") to separate legitimate messages from spam messages.

This may also be the case with the meaning, in which case they could be removed.

Objective 3: Identification and decoding of obfuscated words, to correct message content and enhance syntactic and semantic analysis.

Obfuscated words such as "g00d" do not exist in semantic dictionaries and consequently must be deobfuscated before searching for their meaning.

(25)

1. INTRODUCTION

1.2 Contributions

The main contributions of this thesis and the corresponding scientific publications are described as follows:

A system that classifies messages using an approach based on the semantics of words has been developed. The meaning of the word is contextualized using the Babelfy [78] services and then semantic relations are extracted and analyzed using Babelnet [80,83] semantic network.

The time and computational resources for the classifier training phase have been reduced (the number of tokens has been reduced more than four times with very close classification results). This has been achieved by reducing the number of redundant words by grouping taxonomically related words. Moreover, from a theoretical point of view, when combining highly dependent features, intermediate features are reduced, which increases the performance of techniques such as Naïve Bayes.

Similar meaning words clustering, leads to identification of the topic of a message.

This provides the background for the intentionality identification in terms of the subject of the message. It has been demonstrated that texts about "anti-impotence drugs" could be connected in the same group, enabling the identification of the subject of various spam-related messages.

The experiments performed allow to conclude that the use of a synset⁷representation reduces the number of False Positive (FP) errors while lead to a slight increase of False Negative (FN) errors (see the results in Chapter4). This shows that ham contents, usually without obfuscation and spelling errors, allow a higher number of words to be successfully translated into synsets and a better ranking in this type of instance. In contrast, most of the words included in spam contents (with many misspelled words, obfuscated tokens orURLs) cannot be successfully represented in synset-based representations, resulting in lower information collection for this type of texts, as well as lower classification performance.

Three different feature vector element reduction (dimensionality reduction) strategies have been formulated and tested to be used when texts are represented using synsets. The first one involves an information lossless strategy, the second one involves a low loss of information and the third one involves loss of information.

7Is a group of data elements that are considered semantically equivalent for the purposes of information retrieval.

(26)

1.3. Publications

The first two strategies have been experimentally tested and the third one has only been developed and analyzed on a theoretical level. The results obtained allow us to conclude that lossless feature reduction schemes can be successfully complemented with low loss approaches for the identification and removal of irrelevant or noisy features, in order to reduce the computational costs of classification.

A system based on Convolutional Neural Network (CNN)s has been developed to identify and decode Leetspeak obfuscated characters, allowing the recovery of tokens that were previously unusable for classification purposes. At the same time and to carry out this research work, several datasets have been developed (an image database to train theCNNand four datasets for evaluating Leetspeak decoding processes) and are now publicly available.

1.3 Publications

The following is a list of national and international conferences and journals in which sections of this thesis have been published.

JCR journals

de Mendizabal, I. V., Vidriales, X., Basto-Fernandes, V., Ezpeleta, E., Mendez, J.

R., Zurutuza, U. (2022). Deobfuscating Leetspeak with Deep Learning to Improve Spam Filtering. International Journal of Interactive Multimedia and Artificial Intelligence. [In Review]

de Mendizabal, I. V., Basto-Fernandes, V., Ezpeleta, E., Mendez, J. R., Gómez- Meire, S., Zurutuza, U. (2022).Multiobjective Evolutionary Optimization for Di- mensionality Reduction of Texts Represented by Synsets. Knowledge and Informa- tion Systems. [In Review]

de Mendizabal, I. V., Basto-Fernandes, V., Ezpeleta, E., Mendez, J. R., Zuru- tuza, U. (2020). SDRS: A new lossless dimensionality reduction for text corpora.

Information Processing & Management, 57(4), 102249. [In Press]

Conference papers

de Mendizabal, I. V., Ezpeleta, E., Zurutuza, U. (2021). Reducción de dimensionalidad sin pérdida en representaciones semánticas de texto. JNIC 2021. VI Jornadas Nacionales de Investigación en Ciberseguridad. Ciudad Real, Spain. 9-10 June.

[In Press]

(27)

1. INTRODUCTION

de Mendizabal, I. V., Ezpeleta, E., Ortega, U. Z., Ordás, D. R. (2018). La inten- ción hace el agravio: técnicas de clustering conceptual para la generalización y especialización de intencionalidades en el spear phishing. In Actas de las Cuartas Jornadas Nacionales de Investigación en Ciberseguridad (pp. 41-42). Mondragon Unibertsitatea. [In Press]

1.4 Document Structure

The rest of this thesis document is organised as follows:

First, in Chapter2, the technical background of thesis-related topics are described to introduce the reader the concepts and terminology used throughout the document.

Next, in Chapter3the state of the art in the spam filtering is presented.

In Chapter4, the first contribution is presented. The new proposal for reducing the number of features by clustering words/tokens is described. Next a new proposal for feature reduction with low-loss of information as a result of the elimination of low significance words/tokens is presented in Chapter5. In Chapter6the problem of text obfuscation in spam messages is discussed, as well as its effect in syntactical and semantic text processing and classification. A system for solving this problem using neural network-based computer vision is presented.

Finally, Chapter7synthesises the contributions of this thesis and describes possible future work.

(28)

Chapter 2

Technical Background

The aim of this chapter is to present the technical background in the spam filtering domain and to introduce the reader to the core concepts and terminology used throughout the document. The chapter consists of four sections. The first section introduces the types of spam and the threats they represent. The second section introduces spam filtering techniques, feature extraction and representation. The third section addresses the feature dimensionality problem and dimensionality reduction strategies. As a consequence of this reduction, the most relevant words, that can help to determine the intentionality of the message, are identified and used for classification purposes.

Finally, the problems resulting from obfuscated words embedded in text messages is described.

2.1 Spam

One of the earliest definitions of e-mail spam is provided by the Text Retrieval Conference (TREC)¹. Spam was defined as: "Unsolicited and unwanted e-mail that was sent indiscriminately, directly or indirectly, by a sender who has no current relationship with the recipient".

This definition is still valid today, but needs to be adapted to new media, new platforms and Online Social Network (OSN). Nowadays spam can come not only by email, it can also come by Social Media, Short Message Service (SMS), Instant Messaging (IM) and other new platforms. The terms "unsolicited" and "unwanted" in theTRECdefinition are applicable for all of the spam messages.

Unfortunately spam distribution is currently a serious problem that spreads through a wide variety of channels and services. Some services commonly used to distribute junk content are web 2.0 applications (which gave rise to the concept of spam 2.0)

1Available at https://trec.nist.gov

(29)

2. TECHNICALBACKGROUND

[19], search engines [20, 40] (Webspam), email [86], short message service SMS [49,104] andIMapplications [92].

There are some manual spam-fighting techniques that work really fine and require a little effort from a system administrator. Examples of these manual technique are whitelisting and blacklisting [67]. It is very easy for an administrator to establish the directive that "everything that comes from a specific Internet Protocol (IP) address" is spam (blacklist) and everything that comes from other ones (whitelist) is ham (legitimate). However, in this chapter we will not explore this manual spam filtering systems, because they have maintenance costs and are not robust against spam techniques that use dynamic origins/sources of spam.

A variety of different automatic techniques to combat the problem of spam are also available. Some of them are used to filter spam in more than one environment, such asSMSmessage spam detection, Twitter messages, e-mail orIMtext message.

Other spam filtering systems are more specific, such as Webspam detectors. However, most of them are based on the same operating principles and use the basics ofML.

In order to distinguish legitimate messages from spam, the automatic classifiers need to "learn" the features of spam messages and the features of legitimate messages.

This phase is called training and it is at this stage at which the classifier configures itself to distinguish the legitimate messages from the spam messages. To understand how this works, consider that some of the messages that a user receives contain the word "viagra". It’s easy to understand that the existence of the word "viagra" in the texts, increases significantly the probability to be a spam message. If it is also combined with the words "offer" or "cheap", it is probably a spam message. However, the appearance of the word "at" does not provide relevant information, since it appears indistinctly in any type of message. The selection of the word ("viagra") as a relevant word and the exclusion of the word ("at") is a very important part of the learning process and is known as "feature engineering". The following section describes this process in detail.

2.2 Feature Selection

Based on the language properties in which the messages to be classified are written, it is necessary to discriminate between the words that can be used to improve the classification and between the words that will not provide any contribution to the messages classification. The identification of the words that must be considered message features is a work that has to be done before the classifier training process. A review of the literature has shown that there are statistical indicators to calculate the

(30)

2.2. Feature Selection

measure and the quality of the words to distinguish text messages between spam and ham. The techniques that make possible the processing of text to extract information from messages are known asNLPand have been widely used in the field of antispam [45]. The classification ability is strongly related to the characteristics or words used to identify each class. A set of text preparation and text processing steps need to be done before the feature extraction takes place. These steps are presented next.

1. Pre-Procesing.

Pre-processing comprises the following tasks:

Removing punctuation marks

In the real life the texts include punctuation marks, which are useful to make sentences easier to understand for the reader, but they have low or no value for machine processing purposes. We can find punctuation marks like colon, semicolon, special symbols, emojis, etc. but they are not helpful in the computer automatic classification. When working inNLP, it has been demonstrated [98]

that the cleaning and elimination of these characters is a mandatory step.

Removing Stop Words

Texts message often contain words that are not used in the classification process because they can exist equally in spam and legitimate messages. To reduce the classifier complexity and processing time, it is convenient to remove these kind of words [47]. The classifier can also do this itself, usually using a dictionary/list or detecting the use of words that do not add value to the classification process, relegating them as not relevant to the classification process, but it may take quite some time. Stop Words are usually removed from texts at this step. The words like "in", "the" and "a" and others of this kind, can be removed before the tokenization process, resulting in a reduction in the number of tokens. Taking as an example the sentence "This is a very long text", we can see that after removing the stop words the sentence gets reduced while keeping its meaning

"long text".

Stemming

In the stemming process, the different forms of the word are converted into a single recognized form, avoiding concept duplicities and the problem of han- dling a concept as two different words. As an example a stemmer would leave the three words "cleans", "cleaning" and "cleaned" as the word "clean".

(31)

Once the first phase of pre-processing has been completed and the text is free of punctuation marks and Stop Words, the text is converted into input elements for the tokenisation phase.

2. Tokenization

Tokenization splits a piece of text into individual words based on a certain delimiter.

In this process [50] a piece of text is divided into individual words based on a delimiter (such as a blank space) for further text processing. These elements are represented in a feature vector, like the one shown in Table2.1. When tokens are created, they can be formed by a single word, by two words, by three words or even more words. Continuing with the previous example we could have two unigrams ["long", "text"] or one bigram ["long text"] or even the combination of unigrams and bigrams such as ["long text", "message"].

3. Representation

The processed texts must be represented in a specific format that can be read by the classification algorithm. As a general rule, this information can be stored in any way if there is a data transformer to make it as required by the classifier. However, in order to visualize how the algorithm works, the data must be represented in a way that has sense to humans. Next we will see the two representation methods that are most commonly used in text classification.

Simple count

Is the simplest way to represent text. In this case the feature vector (one row per message) represents the tokens contained in a message and how many times they appear in each message. The collection of feature vectors forms a matrix, in which rows they appear the messages and columns the tokens and how many times they appear it in each message. The intersection between rows and columns shows the occurrences of the token in each message. The representation in Table2.1shows the occurrences of each word in the text. If a token appears twice in a message, this is indicated by the number 2 in the corresponding row. If the token is not contained in the message, it is shown with a 0 in the corresponding row.

Message 1: The text has few words and is very easy to understand.

Message 2: The text is written in English.

Message 3: It is a fragment of a book.

(32)

the text has few words and is easy to understand writen in english it a fragment of book

1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 1 1 1

Table 2.1: Simple count feature vector representation.

Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF (Term Frequency - Inverse Document Frequency) indicates the frequency of occurrence of a term/word in all messages belonging to the corpora /dataset of the messages. It is a numerical indicator that expresses how relevant is a word to identify a message type (ham or spam) in a collection of messages.

The value increases when a term appears more often in a message because it permits to match the word with a message, and in this way, with the class ham or spam. This indicator helps to identify the words that appear more often in spam or legitimate messages. It may occur that a word appears many times in a message, but as a very common word it may also appear in many other messages in which case, its value to identify the class (ham or spam) of a message may be reduced. The TF-IDF set allows to handle the problem of the existence of words that are very common to all messages. The TF method provides better metrics related with the quality of a term than the number of occurrences of the term, making TF-IDF one of the most widely used metrics to show the quality of terms. Term Frequency is mathematically represented in equation2.1and Inverse Document Frequency is represented in equation2.2.

T F (t, D) = f_t,D P

t⁰∈Df_t⁰_D (2.1)

In this equation ft,D is the Frequency of the term t in the document D and P

t⁰∈Df_t⁰_D is the number of terms in document D.

IDF (t, D) = log N

|d ∈ D : t ∈ d| (2.2)

In the case of Inverse Document Frequency, for the term t in document D, N is the total of documents and |d ∈ D : t ∈ d| is the number of documents with term t.

4. Feature selection

Feature selection is the most important step before classification. Once the stop words have been removed, tokenization is completed and the terms are represented

(33)

in a feature vector, it is necessary to select the terms to be used by the classifier.

The size of the feature set or terms will impact on the speed at which the classifier learns. As more information needs to be processed, more time will be spent in the learning or training phase. On the other hand, if there are few terms, the training phase will be very short, but the classification may not be good. In this section we describe some metrics that can help to select the best terms and discard others that do not contribute to the classification.

Information Gain (IG)

Before describing the Information Gain calculation, it is necessary to introduce the concept of entropy, which is mathematically represented by the equation2.3.

entropy(S) =

n

X

i=1

−p_ilog₂pi (2.3)

Where S is a collection of sets, p_iis the probability of a certain event and i is an event, where i = [1, n] and event is a word/term appearing in a message . As an example, we can see in the equation2.4what is the calculation of the entropy for obtaining a specific number on a six sided dice.

(S) = −6 1

6log₂ 1 6

≈ 2, 58 (2.4)

Entropy is a metric of uncertainty or disorder, and is used to help to decide which feature should be selected in the next step. In general, an attribute that helps to discriminate more objects has a tendency to reduce entropy, so it should be used in the next division of the dataset.

Once the entropy concept has been described, the next step is to describe the Information Gain, which is mathematically represented by the equation2.5.

Inf oGain(S, F ) = Entropy(S) − X

v∈V (F )

|S_v|

|S|entropy(Sv) (2.5) Where S is an objects set, F are the objects features and V (F ) are the values that the objects can take. This expression shows that a higher value of the Information Gain of a feature, makes higher its discriminative power. Thus, in order to identify the features that will be better to divide the messages into the two classes (ham and spam), the elements with the highest Information Gain are selected.

(34)

Gini index

The Gini index helps to select features based on the degree of purity of each feature in relation to the class. Purity measures the level of discrimination of one feature to differentiate between different classes. This feature selection method indicates the degree of purity of the chosen feature. For a selected feature, the Gini index is calculated as shown in equation2.6.

GI(ti) =

m

X

j=1

p(ti|C_j)²p(Cj|t_i)² (2.6)

Where m is the number of classes, p(ti|C_j) is the titerm probability to be in the given class and p(C_j|t_i) is the probability that the class C_j contains the term t_i.

Chi-Square

The Chi-Square indicator is used to determine the existence or not of independence between two variables. When two variables are independent, it means that they are not related, so one does not depend of the other and vice versa. In this case, we can measure the lack of independence between a word (w) and a class C. Thus, with the study of independence, a method also verifies whether the frequencies observed in each category are compatible with the independence between both variables. To assess the independence between the variables, the values that would indicate absolute independence known as "expected frequencies" are calculated and compared with the frequencies in the sample. The calculation of the lack of independence between feature or characteristic (c) and class i can be seen in equation2.7.

X_i² = nF (c)²(p_i(c) − P i)²

F (c)(1 − F (c))P i(1 − P i)) (2.7) In this equation n is the total number of messages in the dataset, p_i(c) is the probability of class i for messages that contains the feature c, Piis the global fraction of messages consisting in class i and F (c) is the global fraction of messages that contains the feature c. This measure returns the normalized value of X_i²and permits to identify the relevant features for different classes.

With the features filtered, cleaned, represented and with the values they provide to the classification process calculated, it is time for feature selection. The objective is to use features that improve the quality of the classification results of a machine learning process, comparing with the raw data supply. The selection of those features, which do not always improve the classification process due to

(35)

their high value, is known as feature engineering.

2.3 How to distinguish legitimate messages from spam messages

From the first email sent by Ray Tomlinson in 1971 [94,95] to the arrival of social networks or Instant Messaging, email has been the most common communication tool in the Internet and has consequently become the most²affected by the problem of spam, with spam shares close to 45%. A variety of systems have been developed to try to filter unwanted messages, some of them based on manual techniques as can be seen in Chapter 2 of the book of Gordon V. Cormack. et.al. [24] and others based entirely onMLtechniques [91].

Today’s spam filters can be divided into two categories[67], (i) non-MLbased systems [24] and (ii)MLbased systems [91]. In this research work we are going to address onlyMLbased techniques.

2.3.1 Machine Learning Approaches

MLis one of the better performing approaches to spam filtering. MLtechniques have the ability to learn what identify the spam emails by parsing lots of spam messages from a large collection of previously collected emails. Classifiers developed with these techniques have the ability to adapt to variable conditions, as they generate their own rules based on what they have learned. MLtechniques can be divided into two classes [63], one based on "supervised learning" [70] and other based on

"unsupervised learning" [46].

In supervised learning models, algorithms work with previously classified and labelled datasets, looking for a function to classify the input data into the correct class. The algorithm must infer what is the classification function, so an input data called training set is used to create it, which will be able to predict, with greater or lesser accuracy, the appropriate output class for a new input data. There exists several types ofMLbased classifiers that have been used in spam filtering. Some of the most popular classifiers are:

Decision trees

A decision tree is a technique used to represent the relationship between the elements of a dataset based on a series of conditions. For the classification of spam

2Available at https://securelist.com/spam-and-phishing-in-2021/105713/

(36)

2.3. How to distinguish legitimate messages from spam messages

messages, Boosting algorithms are usually used, which work by sequentially adjust- ing simple models that predict only slightly better than expected by random chance and each new model uses the information from the previous model, improving iteration by iteration. There are studies that claim that Boosting methods outperform [99] Naïve Bayes classifiers on specific email corpora. However, in the case of spam filtering, the most widely used algorithm has been AdaBoost [42].

Naïve Bayes classifiers

One of the best known algorithms for text classification is also called a linear classifier. In the case of spam classifiers, this type of probabilistic classifiers based on Bayes’ theorem has been one of the first proposals, being widely analyzed in different works [32] [9].

Support Vector Machines (SVM)

SVMare a set of supervised learning algorithms developed by Vladimir Vapnik [105] and his team. These algorithms can help in classification as well as regression tasks. An SVM is a model that splits the points to be classified into two spaces as wide as possible by means of a separation hyperplane defined as the vector between the two closest points between the two classes, which is called the support vector.

SVMs are very useful for text categorization problems, and are used for spam email filtering [64].

In the case of unsupervised learning algorithms, these are used when there is no labelled data for training. Therefore only the structure of the data can be described in order to find some type of organization to simplify the analysis. Clustering tasks try to makes groups based on similarities, but there is no guarantee that these have any useful meaning or utility. Some of the most commonly available algorithms for unsupervised learning are the following:

Clustering algorithms

Clustering [114] is the grouping of data into groups of similar items. Representing the data in a series of clusters, involves the loss of detail, but achieves simplification of detail. From a practical point of view clustering represents a very important role in data mining applications, such as information retrieval and text mining and is a technique to be used in this thesis. In case of using a suitable representation, most clustering algorithms can split between spam or legitimate email datasets, as demonstrated by Whisshell et.al. [111].

Principal Component Analysis (PCA):

PCA[35] is a statistical procedure that uses an orthogonal transformation to convert

(37)

Id Ok Pressure RPM Temperature

1 0 0 0 125

2 1 0,50 0 126

3 1 0,72 0 127

4 1 3,25 0 133

5 1 3,75 0 137

6 1 4,50 100 147

7 1 5,00 300 137

8 1 5,00 390 147

9 1 3,25 380 141

Table 2.2: AOI manufacturing process variable generalization step 0.

a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

2.4 Conceptual clustering

The Attribute Oriented Induction (AOI) algorithm is a hierarchical clustering algorithm based on the generalization concept. It was first used by Jiawei Han, Yandong Cai, and Nick Cercone in 1992 as a method for knowledge discovery in databases [54].

The AOI algorithm follows an iterative process, in which each attribute or value of a system variable has its own hierarchical tree. When a variable moves from one level of generalisation to another, that generalisation is applied to all the data in the set.

This step is called concept tree ascension [23].

The use of more general elements makes it possible to find clusters that otherwise could not be generated. To understand the concept of generalization graphically, an example with variable values of a manufacturing process is proposed. These variables are shown in Table2.2, whose first column identifies the number of the part produced, the second variable is the indicator of defective (with 0) part or good (with 1) part, the third measure is the pressure applied to the part, the fourth shows the r.p.m. of the pressure pump and finally, the value of the temperature of the fluid. In this example it is clear that in table2.2all the parts have been produced with different values.

In the first iteration of the algorithm, the values of the temperature variable are generalised. Values below 130 will be generalised to the value "low". Values between 131 and 140 will be generalised to the "medium" value and values above 140 will be generalised to the "high" value. The result of these changes can be seen in table2.3.

In the second iteration of the algorithm, the values of the pressure variable will be generalised. Values between 0 and 1,00 are generalised to the value "low". Values

(38)

2.4. Conceptual clustering

1 0 0 0 low

2 1 0,50 0 low

3 1 0,72 0 low

4 1 3,25 0 medium

5 1 3,75 0 medium

6 1 4,50 100 high

7 1 5,00 300 medium

8 1 5,00 390 high

9 1 3,25 380 high

1 0 low 0 low

2 1 low 0 low

3 1 low 0 low

4 1 high 0 medium

5 1 high 0 medium

6 1 high 100 high

7 1 high 300 medium

8 1 high 390 high

9 1 high 380 high

between 1,01 and 2,99 are generalised to the value "medium". Finally, values above 3,00 are generalised to the value "high". The result of these changes can be seen in Table2.4.

At this stage of generalization and after several iterations of theAOIalgorithm it is possible to generate clusters. In this example, the data for parts 2 and 3 are the same and the data for parts 4 and 5 are the same. These equal data will be used to create a two clusters and we will continue with the application of the algorithm with the data of the Table2.5.

For the next iteration of theAOIalgorithm, additional data needs to be generalised, so we work with the column of revolutions per minute of the pump. In this case values below 100 rpm are generalised to "low" values, values between 101 and 299 RPM are generalised to "medium" values, values between 300 RPM and 349 are generalised to

"high" values and values above 350 are generalised to "very high". In this way there are two other parts with the same manufacturing conditions, parts 8 and 9, which will

(39)

1 0 low 0 low

6 1 high 100 high

7 1 high 300 medium

8 1 high 390 high

9 1 high 380 high

Table 2.5: Values for parts 2,3,4, and 5 are stored in a cluster.

1 0 low low low

6 1 high medium high

7 1 high high medium

8 1 high very high high

9 1 high very high high

Table 2.6: New cluster is generated for parts 8 and 9.

1 0 low low low

6 1 high medium high

7 1 high high medium

Table 2.7: Final state of the parts manufacturing table.

be grouped into a cluster. The results of this step can be seen in Table2.6.

The iteration of the algorithm can be executed as many times as required and the generalization of the data can be as large as necessary. It would even be possible to group all available data into a single cluster under very general conditions. However, applying generalisations that group two or more tuples is enough for this example.

The stopping condition of the algorithm is determined by the number of clusters to be generated. For this example, having reached the three clusters stage, it has been decided to stop the execution of the algorithm. This leaves out of the clusters some data, in particular the ones shown in Table 2.7.

The application of theAOIalgorithm to the manufacturing parts in the example, makes it possible to identify the three clusters of data shown in the Table2.8. These clusters make it possible to join manufacturing conditions that were initially completely different from each other.

(40)

2.5. Disambiguation

cluster Ok Pressure RPM Temperature

1 1 low 0 low

2 1 high 0 medium

3 1 high very high hign

Table 2.8: Clusters generated final state of the parts manufacturing table.

Figure 2.1: Disambiguation problem for the word "bank".

2.5 Disambiguation

The process of semantic disambiguation of words, also known as Word Sense Dis- ambiguation (WSD), consists in the identification of the particular meaning of a polysemous word [79] within a sentence or within a context.WSDis necessary to improve the tasks such as machine translation [18], syntactic analysis [4] or Information Retrieval (IR) [118], and in our case it also helps to improve the classification of messages. The assignment of the correct meaning to each word within a context is complex and has been studied in conjunction withNLP, so there are many works on this topic, such as the article by Ide et. al. [58], chapters in genericNLPbooks such as Chapter 7 of the book Foundations of statistical Natural Language Processing [73], books focused in the state of the art such as Navigli et. al [79] and specific books and works [3,103,13].

In the sentence shown in Figure 2.1 it can be seen that the word "bank" has multiple meanings. It is necessary to disambiguate this word within a context because depending on the words that follows "bank", its meaning can be different. If it is followed by "blood" it will have one meaning, if it is followed by "to sit down" it will have another meaning and if it is followed by "to take out money", it will have another meaning. The ability to distinguish between the different meanings is key to improving filtering and classifying messages as spam or ham.

Currently inNLPtasks that require disambiguation, the trend is to use databases

(41)

that connect synsets (groups of meaning) with others through semantic relations represented in the form of graphs, such as WordNet³or BabelNet⁴, as can be seen in the works [72], [82], [81]. The techniques used by these databases explore the existing relationships between the senses of a word in a specific context.

2.6 Optimization with Genetic Algorithms

The principles of the Genetic Algorithm (GA) were established by Holland [55] in 1975.GAs consist of adaptive methods that are used to solve search or optimization problems using the genetic selection process of living organisms.GAs start working with a population of candidate solutions (called individuals), each one representing a possible solution to the problem. GAs are able to scan the solution space, evaluate how valid is the proposed solution, measuring how close or not they are to their goal.

WithGAit is mandatory to have a genetic representation of the solution domain.

For each individual, a value is assigned that indicates how good it is at solving the problem, known as a fitness function, that evaluates the solution domain and identifies the best individuals. These top individuals will be the ones that are crossed with other individuals, generating offspring that will inherit some of the characteristics of their parents. In this way, a new population of possible solutions is produced, which replaces the previous one as long as it has better characteristics to solve the problem.

InGAs, the possible solutions to the problem are represented as a set of parameters, known as the genes, which are grouped together in a string to form the chromosome.

These chromosomes will be selected in groups of two to generate two offspring, which will be other possible solutions. These offspring will have a combined part of their parents’ chromosome. A mutation operation can act on the offspring, which can randomly generate "super-individuals" that can be closer to the optimal solution.

Figure2.2shows the process of combining the genes of two parents and mutating of one of the children.

Genetic algorithms are widely and successfully used in complex optimization processes, including spam detection works [62,93,12]. It is also common that the optimization process is performed to improve more than one simultaneous objective, known as multi-objective optimization [90, 56]. It is in these cases where genetic algorithms can explore different combinations of the solution space in a targeted way using multiple fitness functions.

3Available at https://wordnet.princeton.edu/

4Available at https://babelnet.org

(42)

2.7. Text obfuscation

Figure 2.2: Genetic algorithm iteration.

Figure 2.3: Examples of images attached to spam messages. These images are part of publicly available "Image Spam Dataset".

2.7 Text obfuscation

In the fight against spam, spammers are constantly looking for new ways to avoid anti- spam filters. One way to prevent content filters from analyzing the text of messages consisted in embedding text inside images [15]. This trick became very popular in 2006 and 2007 [10,15] and is based on the fact that anti-spam filters cannot analyze the content of images but they are easily identifiable and understandable by humans.

Figure2.3shows several images containing embedded text, which cannot be processed by content-based classifiers [6], but are easily identifiable as spam by anyone.

In order to avoid this type of spam, some researchers have used Optical character recognition (OCR) techniques [22] but they are quite computationally demanding and vulnerable to image artefacts [44] even making a post-corrections [60]. Moreover, Fumera et al. published in their work [43] "Spam filtering based on the analysis of text information embedded into images" an empirical experiment demonstrating that text can easily be hidden from theOCRsystem making spam filters unable to identify this type of spam messages. In addition, spammers also add noise [36], as in the image on the right hand side of figure2.3. Recent techniques such as CAPTCHA [16], perform image distortions in order to make them impossible to be interpreted by an

(43)

viagra viagra viagra viagra viagra viagra

\/iagra v1agra vi4gra via6ra viag12a viagr/\

|/iagra v¡agra vi/\gra via(_-a viag/2a viagr4 Table 2.9: Word "viagra" writen in Leetspeak.

automatic text recognition system. However, we may be close to Artificial Neural Network (ANN)-based systems being able to recognize the texts [110,34] embedded in these captchas.

Another important area in terms of image spam is the Leetspeak slang, also known as leet, leet text or 1337. This type of syntax has been used since 1980 [34,41] and consists of replacing some characters with symbols that are visually similar to those characters, allowing to read the text without any lexical, syntactical or semantic loss.

This type of encoding has two effects (i) it prevents the classifier from identifying, tokenising and processing the word and (ii) it produces a Bayesian poisoning attack [116] by inserting random, and apparently harmless, words into the texts of spam messages, causing the spam message to be incorrectly classified. Table2.9shows 12 Leetspeak representations of the word "viagra". Each column of Table2.9shows a possible substitution of a single character in the word.

As can be seen in Table2.9, Leetspeak exploits the similarity of certain characters with a punctuation mark (or combinations of them) to substitute the characters and preventing the spam filter to correctly process the word and classify the message.

This character substitution causes misrecognition and misrepresentation of the word that contains the character during the classification phase, which allows spammers to bypass filters. In Leetspeak any character can be replaced in some cases by a single symbol, as in the case of the "i" which can be replaced by a "¡", or in other cases by a combination of several symbols, as in the case of the "A" character, which requires two "/\" and even three "/-\" symbols. Since Leetspeak does not use a limited and predefined number of characters, it it is not possible to enumerate all possible words transformations into a dictionary. Table2.10shows an example of possible substitutions used when writing texts in Leetspeak.

(44)

2.7. Text obfuscation

a 4, /\, /-\, |-\ h #, /-/, [-], ]-[, )-(, (-),

:-:, |∼|, |-|, ]∼[, }{ o 0, (), [], ø v \/, |/

b 8,|3, ß, ]3, ]8, |8,

!3 i 1, !, |, ], : p |^o, |? w \/\/, \^/, \_|_/, \_:_/, |/\|,

‘//

c (, <, [, c , ç, { j ¿, , _/, _), 7 q "(_,)", "()," x )(, %, ><

d [), |), |], [>, , ]) k |{, |<, |(, }< r /2, |2, 12 y ’/

e 3, [- ,£,e l |_, []_, [_, 1_ s 5, $, § z 7_, 2, >_

f =, /=, |# m |\/|, /\/\, (\/), [\/],

//\\//\\, /^^\, t +, †

g 6, (_+, (_- n |\|, /\/, [\], (\), //\\// u |_|, _/, (_), /_/, ]_[

Table 2.10: Examples of posible Leetspeak substitutions.

(45)

Chapter 3

State of the art

The aim of this chapter is to describe the state of the art in the areas to which this research has contributed. In order to establish the current state of the art, the literature related to each of the topics covered in this thesis has been reviewed, such as topic and concept identification, a subject that has been developed in several previous works, and the deobfuscation of characters, an area in which very few research works have been found.

3.1 Topic and concept identification

MLhas experienced incredible expansion as it has been used to solve a multitude of problems. Thanks to its ability to use past experiences and related information as input data, it is being widely used in problems where there is a large volume of data. For the particular problem of spam filtering, a multitude of binary classifiers can be used in popular tools such as Weka [53]. The good values in term of evaluation metrics (Precision and Recall) obtained with these classifiers for spam filtering in any kind of environment has been demonstrated in several works and is therefore beyond doubt [109,66,5]. However, these classifiers are highly dependent on the input data, which means that if the input data is not relevant or not clean, the result in terms of Precision and Recall can be get worse. A key aspect to applyMLtechniques on a dataset is the previous preparation [117] because the performance achieved by the classifiers depends directly on this aspect.

Token-based information mining has provided the basis for what is known as text mining [97]. This has emerged as the way to exploit token information to solve problems such as information retrieval [65] and text classification [59] for instance.

From the first knownMLproposal for spam filtering by Paul Graham¹, this technique has evolved to others [77] [87] that have been introduced afterwards and that take

1Available at http://paulgraham.com/spam.html

Dimensionality reduction for the improvement of anti-spam filters

DOCTORAL THESIS

DIMENSIONALITY REDUCTION FOR THE IMPROVEMENT OF ANTI-SPAM FILTERS

T

Dimensionality reduction for the improvement of anti-spam filters

Acknowledgments

Originality Statement

Abstract

Laburpena

Resumen

Contents

List of Figures

List of Tables

Acronyms

Chapter 1

Introduction

1.1 Research statement

1.2 Contributions

1.3 Publications

1.4 Document Structure

Chapter 2

Technical Background

2.1 Spam

2.2 Feature Selection

2.3 How to distinguish legitimate messages from spam messages

2.4 Conceptual clustering

2.5 Disambiguation

2.6 Optimization with Genetic Algorithms

2.7 Text obfuscation

Chapter 3

State of the art

3.1 Topic and concept identification