A case study of Spanish text transformations for twitter sentiment analysis

(1)

ContentslistsavailableatScienceDirect

Expert Systems With Applications

journalhomepage:www.elsevier.com/locate/eswa

A case study of Spanish text transformations for twitter sentiment analysis

Eric S. Tellez

^a^,^b^,^∗

, Sabino Miranda-Jiménez

^a^,^b

, Mario Graff

^a^,^b

, Daniela Moctezuma

^a^,^c

, Oscar S. Siordia

^c^,¹

, Elio A. Villaseñor

^b

aCONACyT Consejo Nacional de Ciencia y Tecnologá, Dirección de Cátedras, México

bINFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación Circuito Tecnopolo Sur 112, Fracc. Tecnopolo Pocitos, CP 20313, Aguascalientes, Ags, México

cCentro de Investigación en Geografía y Geomática “Ing. Jorge L. Tamayo”, A.C. Circuito Tecnopolo Norte No. 117, Col. Tecnopolo Pocitos II, C.P. 20313, Aguascalientes, Ags, México

a rt i c l e i n f o

Article history:

Received 1 June 2016 Revised 11 March 2017 Accepted 30 March 2017 Available online 31 March 2017 Keywords:

Sentiment analysis

Error-robust text representations Opinion mining

a b s t r a c t

Sentimentanalysis isatextmining taskthat determinesthepolarityofagiventext,i.e., itspositive- nessornegativeness. Recently, ithas receivedalot ofattention giventhe interestinopinion mining inmicro-bloggingplatforms.Thesenewformsoftextualexpressionspresentnewchallengestoanalyze textbecauseoftheuseofslang,orthographicandgrammaticalerrors,amongothers.Alongwiththese challenges,apracticalsentimentclassiﬁershouldbeabletohandleeﬃcientlylargeworkloads.

The aimof thisresearch isto identifyin alarge set ofcombinations which text transformations (lemmatization,stemming, entity removal,amongothers), tokenizers (e.g.,wordn-grams), and token- weighting schemes make the most impact onthe accuracy of a classifier(Support Vector Machine) trainedontwo Spanishdatasets. Themethodologyusedistoexhaustivelyanalyzeallcombinationsof texttransformationsand theirrespective parameterstofindoutwhatcommoncharacteristicsthebest performingclassifiers have.Furthermore,weintroduceanovelapproachbasedonthecombinationof word-basedn-gramsandcharacter-basedq-grams.Theresultsshowthatthisnovelcombinationofwords andcharactersproducesaclassifierthatoutperformsthetraditionalword-basedcombinationby11.17%

and5.62%ontheINEGIandTASS’15dataset,respectively.

1. Introduction

In recentyears,the productionof textual documents insocial media hasincreasedexponentially; forinstance,up toApril2016, Twitterhas320million active users,andFacebook has1590mil- lion users.² Insocial media, people share commentsaboutmany disparate topics, i.e., events, persons, and organizations, among others. Thesefactshave hadthe resultof seeing social media as agoldmine ofhumanopinions,andconsequently, thereisanin-

∗ Corresponding author.

E-mail addresses: [email protected] , [email protected] (E.S. Tellez), [email protected] (S. Miranda-Jiménez), [email protected] (M. Graff), [email protected] (D. Moctezuma), [email protected] (O.S. Siordia), [email protected] (E.A. Villaseñor).

1Partially funded by CONACyT project “Sistema inteligente de extracción, análisis y visualización de información en social media” 247356

2http://www.statista.com/statistics/272014/global- social- networks- ranked- by- number- of- users/

creased interest indoing research and business activities around opinionminingandsentimentanalysisﬁelds.

Automaticsentimentanalysisoftextsisoneofthemostimpor- tanttasksintextmining,wherethegoalistodeterminewhethera particulardocumenthaseitherapositive,negativeorneutralopin- ion³.Determiningwhetheratextdocumenthasapositiveornega- tiveopinionisbecomingan essentialtoolforbothpublicandpri- vatecompanies, (Liu,2015; Peng, Zuo,&He, 2008). Giventhat it isa useful tool to knowwhat people think aboutanything; so, it representsamajorsupportfordecision-makingprocesses(forany levelofgovernment,marketing,etc.)(Pang&Lee,2008).

Sentimentanalysis hasbeentraditionally tackledas aclassiﬁ- cation taskwhere two major problems need to be faced. Firstly, oneneedstotransformthetextintoasuitablerepresentation,this isknown astext modeling. Secondly,one needs to decide which classiﬁcationalgorithmtouse;oneofthemostwidelyusedisSup-

3Albeit, there are other variations considering intermediate levels for sentiments, e.g. more positive or less positive

(2)

portVectorMachines(SVM).Thiscontributionfocusontheformer problem,i.e., we are interestedinimproving theclassiﬁcation by ﬁndingasuitabletextrepresentation.

Speciﬁcally, thecontributionofthisresearch istwofold.Firstly, we parametrize our text modelling technique that uses different text transformations (lemmatization, stemming, entity removal,amongothers),tokenizers(e.g.,wordn-grams),andtoken- weightingschemes(Table3containsalltheparametersexplored).

This parametrization is used to exhaustively evaluate the entire configurations space to know those techniques that produce the best SVM classifier on two sentiment analysis corpus written in Spanish.Counterintuitively,wefoundthatthecomplexityoftech- niquesused inthepre-processing stepisnot correlated withthe finalperformance oftheclassifier,e.g., aclassifierusinglemmati- zation,which is one ofthe pre-processingtechniques havingthe greatestcomplexity, mightnot be one of thesystems havingthe highestperformance.

Secondly, we propose a novel approach based on the combi- nationofword-based n-gramsand character-based q-grams. This novel combination of words and characters produces a classiﬁer thatoutperformsthetraditionalword-basedcombinationby11.17%

and5.62% on the INEGI and TASS’15 dataset, respectively. Here- after,we willusen-wordstorefer toword-basedn-grams,andq- gramstocharacter-based q-gramsjustto makeacleardistinction betweenthesetechniques.

The rest of the manuscript is organized as follows.

Section 2 deals with literature review. The text transformations aredescribedinSection3,meanwhiletheparameterssettingsand deﬁnitionoftheproblemarepresentedonSection4.Section5de- scribesourexperimentalresults.Finally,Sections6and7present thediscussion andconclusionsofourresults alongwithpossible directionsforfuturework.

2. Relatedwork

The sentiment analysistask has been widely studying dueto the interest to know the people’s opinions and feelings about something, particularly in social media. This task is commonly tackled in two different ways. The ﬁrst one involves the use of staticresourcesthatsummarizethesenseorsemanticofthetask;

theseknowledgedatabasescontainmostlyaffectivelexicons.These lexicons are created by experts, in psychology or by automated processes, that perform the selection of features (words) along withacorpusoflabeledtextasdoneinGhiassi,Skinner,andZim- bra(2013).Consequently,thetaskissolvedbytryingtodetecthow theaffectivefeaturesareusedinatext,andhowthesefeaturescan beusedtopredictthepolarityofagiventext.

The alternativeapproachstatesthetaskasa textclassification problem. This includes several distinguished parts like the pre- processingstep, the selection of the vectorization and weighting schemes,andalsotheclassifieralgorithm.So,theproblemconsists ofselectingthecorrecttechniquesineachsteptocreatethesen- timent classifier.Under this approach, the idea is to process the text in a way that the classifier can take advantage of the features to solve the problem. Our contribution focus in this later approach;we describethebestwaytopre-process, tokenize,and vectorizethetext,basedonafixedsetoftext-transformationfunc- tions.Forsimplicity,wefixourclassifiertobeSupportVectorMa- chines (SVM).SVM is a classifier that excels inhighdimensional datasetsasisthecaseoftext classification,(Joachims, 1998).This sectionreviewstherelatedliterature.

There are several works in the sentiment analysis literature whichuseseveralrepresentations;suchasdictionaries(Alam,Ryu,

&Lee,2016;Khan,Qamar,&Bashir,2016);textcontentandsocial relationsbetweenusers(Wu,Huang,& Song,2016);relationships betweenmeaningsofawordinacorpus(Razavi,Matwin,Koninck,

&Amini,2014);co-occurrencepatternsofwords(Saif,He,Fernan- dez,&Alani,2016),amongothers.

Focusing on the n-grams technique, a method that considers the local context of the word sequence andthe semantic of the whole sentenceisproposed inCui, Shi,andChen(2015).The lo- calcontextisgeneratedviathe“bag-of-n-words” method,andthe sentence’s sentiment isdetermined based on the individual contribution of each word. The word embedding is learned from a largemonolingualcorpusthroughadeepneuralnetwork,andthe n-wordsfeatures areobtainedfromthewordembedding incom- binationwithaslidingwindowprocedure.

Ahybridapproachthatusesn-gramanalysisforfeatureextrac- tion together witha dynamicartiﬁcial neural network for senti- mentanalysisisproposed inGhiassietal.(2013).Here,adataset over10,000,000oftweets,relatedtoJustinBiebertopic,wasused.

As a result, a Twitter-speciﬁc lexicon witha reduced feature set wasobtained.

TheworkpresentedinHan,Guo,andSchuetze(2013)proposes an approach for sentiment analysiswhich combines a SVM clas- siﬁeranda wide rangeof featureslike bag-of-word(1-words, 2- words)andpart-of-speech(POS)features,etc.,aswellasvotesde- rived fromcharacter n-words language models toachieve the ﬁ- nalresult.Theauthorsconcludedthatlexicalfeatures(1-words,2- words)producethebettercontributions.

In Tripathy, Agrawal, and Rath (2016) differentclassifiers and representationswereappliedtodeterminethesentimentinmovie reviews, taken from internet blogs. The classifiers tested were Naive Bayes, maximum entropy, stochastic gradient descent, and SVM.These algorithmsuse n-words,fornin{1,2,3}andall the combinations.Here,theresultsshowthatthevalueofnincreases the classification accuracy decreases, i.e., using 1-words and 2- words theresultachieved isbetter than using3-words,4-words, and5-words.

Regarding theuseofq-grams;inAisopos, Papadakis,andVar- varigou (2011) a method that captures textual patterns is introduced.Thismethodcreatesagraph,whosenodescorrespondtoq- gramsofadocumentandtheiredgesdenotedtheaveragedistance betweenthem.AcomparativeanalysisondatafromTwitterisper- formedbetweenthreerepresentationmodels: termvectormodel, q-grams,andq-gramsgraphs.Theauthorsfoundthatvectormod- els are faster,butq-grams(especially4-grams) performbetter in termsofclassiﬁcationquality.

With the purpose to attend sentiment analysis in Spanish tweets,a numberof works hasbeen presented in the literature, e.g.severalsizesofn-gramsandsomepolaritylexiconscombined withaSupportVectorMachine(SVM)wasusedinAlmeida(2015). Anotherapproach whichusespolarity lexiconswitha numberof features related to n-words, part-of-speech tag, hashtags, emoti- conandlexiconresourcesisdescribedinAraque,Corcuera,Román, Iglesias,andSánchez-Rada(2015).

Features related to lexiconsand syntactic structures are com- monlyused,forexample,(Alvarez-Lópezetal.,2015;Borda&Sal- adich, 2015; Cámara, Cumbreras, Martín-Valdivia, & López, 2015;

delaVega&Vea-Murguía,2015;Deas,Biran,McKeown,&Rosen- thal,2015).In theother hand,features relatedtowordvectoriza- tion,e.g. Word2Vec andDoc2Vec,are alsousedinseveralworks, such as Díaz-Galiano and Montejo-Ráez (2015); Valverde, Tejada, andCuadros(2015).

Following with the Spanish language, in the most recent TASS (Taller de Análisis de Sentimientos ’16) competition, some works presented still using polarity dictionaries and vectorization approach; such is the case of Casasola Murillo and Marín Raventós (2016), where an adaptation of Turney’s dictionary (Turney,2002) over 5 million of Spanish tweets were gen- erated.Furthermore,(CasasolaMurillo&MarínRaventós,2016)in the stepof vectorization usesn-grams andskip-grams in combi-

(3)

Fig. 1. Generic treatment of input text to obtain the input vectors for the classiﬁer.

nationwiththispolarity dictionary. Quirós, Segura-Bedmar,, and Martınez (2016) proposesthe use of wordembedding with SVM classiﬁer.Despitetheexplosionofwordsusingwordembeddings, the classical word vectorization is still in use, (Montejo-Ráez &

Dıaz-Galiano,2016).

A newapproach is using ensembles ora combination ofsev- eraltechniques andclassiﬁers,e.g.,the workpresented inCerón- Guzmán and de Cali (2016) proposes an ensemble built on the combinationofsystemswiththelowestcorrelationbetweenthem.

HurtadoandPla(2016)presentsanotherensemblemethodwhere the Tweetmotif’s tokenizer, (O’Connor, Krieger, & Ahn, 2010), is used in conjunction with Freeling (Padró & Stanilovsky, 2012a).

ThesetoolscreateavectorspacethatistheinputforanSVMclas- siﬁer.

Itcanbeseenthatoneoftheobjectivesoftherelatedworkis tooptimizethenumberofn-wordsorq-grams(almosttackledas independentapproaches),toincreaseperformance;clearly,thereis not a consensus. Thislack ofagreement motivatesusto perform an extensiveexperimental analysisoftheeffectoftheparameters (includingnandqvalues),andso,wedeterminedthebestparam- etersontheTwitterdatabasesemployed.

3. Textrepresentation

Natural Language Processing (NLP) is a broad and complex area of knowledge having many ways to represent an input text (Giannakopoulos, Mavridi, Paliouras, Papadakis, & Tserpes, 2012; Sammut & Webb, 2011). In this research, we select the widelyusedvectorrepresentationofatextgivenitssimplicityand powerfulrepresentation.Fig.1depictstheprocedureusedtotrans- forma text input intoa vector. There arethree main blocks:the firstonetransformsthetextintoanothertextrepresentation,then the textis tokenized,and,finally,the vectoris calculatedusinga weighting scheme.Theresultingvectorsaretheinputoftheclas- sifier.

Inthefollowingsubsections,wedescribedthetexttransforma- tiontechniquesusedwhichhaveacounterpartinmanylanguages, the proper implementation of them rely heavily on the targeted language, in ourcase studythe Spanish language. The interested readerlookingforsolutionsinaparticularlanguageisencouraged tofollowtherelevantlinguisticliteratureforitsobjectivelanguage, in addition to thegeneral literature in NLP(Bird, Klein, & Loper, 2009;Jurafsky&Martin,2009;Sammut&Webb,2011).

3.1. Texttransformationpipeline

One ofthecontributions ofthismanuscript isto measurethe effects that each different text transformation hason the perfor- manceofaclassiﬁer.Thissubsectiondescribesthetexttransforma- tionsexploredwhereastheparticularparametersofthesetransfor- mationscanbeseeninTable3.

3.1.1. TFIDF(tfidf)

Inthevectorrepresentation,eachword,inthecollection,isas- sociated witha coordinate ina high dimensionalspace. The nu- mericvalue ofeach coordinate issometimes calledtheweightof the word. Here, tf×idf (Term Frequency-Inverse Document Fre- quency) (Baeza-Yates& Ribeiro-Neto, 2011) is used as bootstrap- ping weightingprocedure. Moreprecisely,letD=

{

^D1,D₂,...,D_N

}

bethesetofalldocumentsinthecorpus,and f_wⁱ bethefrequency ofthe word win documentD_i. tfⁱ_w isdeﬁned asthe normalized frequencyofwinD_i

tfⁱ_w= f_wⁱ maxu∈Di

{

^fuⁱ

}

^.

Insomeway,tfdescribestheimportanceofw,locallyinD_i.Onthe otherhand,idfgivesaglobalmeasureoftheimportanceofw;

idf_w=log N

{

^Di

|

^fwⁱ >0

}

^.

The ﬁnal product,tf×idf,tries toﬁnd a balancebetween the local andthe globalimportance of a term. It is common to use variants of tf and idf instead of the original ones, depending in theapplicationdomain(Sammut&Webb,2011).Letv_ibethevec- torofD_i,a weightedmatrixTFIDF ofthe collectionD iscreated by concatenating all individual vectors, insome consistent order.

Using this representation, a number of machine learning methods can be applied; however, theplain transformation oftext to TFIDFposessomeproblems.Ononehand,alldocumentswillcon- tain commonterms having a smallsemantic content such asar- ticlesand determiners,among others.These terms areknown as stopwords.The bad effectsof stopwords are controlled by TFIDF, butmostofthemcanbedirectlyremovedsincetheyareﬁxedfor a given language. On the other hand, after removing stopwords, TFIDF will produce a very high dimensional vector space, O(N) in Twitter, since new terms are commonly introduced (e.g. misspellings,URLs,hashtags).ThiswillrapidlyyieldtotheCurseofDi- mensionality,whichmakeshard tolearnfromexamplessinceany tworandomvectorswillbeorthogonalwithhighprobability.From amorepractical pointofview,ahighdimensionalrepresentation willalsoimposehugememoryrequirements,atthepointofbeing impossibletotrainatypicalimplementationofamachinelearning algorithm(notbeingdesignedtousesparsevectors).

3.1.2. Stopwords(del-sw)

In many languages, like Spanish, there is a set of extremely commonwords such asdeterminers orconjunctions(theorand) whichhelptobuildsentencesbutdonotcarryanymeaningthem- selves.ThesewordsareknownasStopwords,andtheyareremoved fromthe text before any attemptto classify them. Astop list is builtusingthemostfrequenttermsfromahugedocumentcollec- tion.WeusedtheSpanishstoplistincludedinNLTKPythonpack- age(Birdetal.,2009).

3.1.3. Spelling

Twitter messages are full of slang, misspelling, typographical andgrammatical errorsamongothers; however,inthisstudy,we focusonlyonthefollowingtransformations:

Punctuation (del-punc). This parameter considers the use of symbolssuch as questionmark, period,exclamation point, commas,amongotherspellingmarks.

Diacritic(del-diac).TheSpanishlanguageisfullofdiacriticsym- bols, and its wrong usage is one of the main sources of orthographic errors in informal texts. Thus, this parameter considerstheuseorabsenceofdiacriticalmarks.

(4)

Fig. 2. Expansion of colloquial words and abbreviations.

Symbolreduction(del-d1,del-d2).Usually,twittermessagesuse repeated charactersto emphasize parts of the word to at- tractuser’sattention.Thisaspect makesthevocabulary ex- plodes. Thus, we applied two strategies to deal withthese phenomena:theﬁrstone replacestherepeatedsymbolsby one occurrenceofthesymbol, andthesecond onereplaces therepeatedsymbolsbytwo occurrencestokeeptheword emphasizeattheminimallevel.

Case sensitive (lc). Thisparameter considers letters tobe nor- malizedinlowercaseortokeeptheoriginaltext.Theaimis tocutthewordsthatarethesameinuppercaseandlower- case.

3.1.4. Stemming(stem)

Stemming is a heuristic process inInformation Retrieval field that chops off the end ofwords and often includes the removal ofderivationalaffixes. Thistechniqueusesthemorphology ofthe languagecoded ina setofrules; to findout wordstems andre- ducethevocabularycollapsingderivationallyrelatedwords.Inour study,weusetheSnowballStemmerfortheSpanishlanguageim- plementedinNLTKpackage(Birdetal.,2009).

3.1.5. Lemmatization(lem)

LemmatizationprocessisacomplextaskfromNaturalLanguage Processingthat determines thelemma ofa groupof wordforms, i.e., thedictionary form ofa word. Forexample, thewords went andgoesaretheverbconjugationsoftheverbgo;andthesewords aregroupedunderthesamelemmago.Toapplythisprocess,we useFreelingtool (Padró & Stanilovsky,2012b) asSpanishlemma- tizer.All textsarepreparedby theErrorcorrection processbefore applyinglemmatizationtoobtainthebestresultsofpart-of-speech identiﬁcation.

3.1.5.1.Errorcorrection. Freelingisatoolfortext analysis,butthe assumption is that text is well-written. However, language used inTwitter is very informal, withslang, misspellings, new words, creativespelling,URLs,speciﬁcabbreviations,hashtags (whichare especial words for tagging in Twitter messages), and emoticons (whichare shortstrings andsymbolsthatexpressdifferentemo- tions). These problems are treated to prepare and standardize tweets for the lemmatization stage to get the best results. All words in each tweet are checkedto be a valid Spanish word or arereducedaccordingtotherulesforSpanishwordformation.

In general, words ortokens withinvalid duplicated vowels or consonants are reduced to validor standard Spanish words, e.g., (ruiiidoooo→ruido(noise);jajajaaa→jaja;jijijji→jaja).Weused an approach based on Spanish dictionary, a statistical model for commondoubleletters, andheuristic rules forcommoninterjec- tions.Ingeneral,theduplicatedvowelsorconsonantsareremoved fromthetarget word;theresultingwordislooked upina Span- ish dictionary (approximately 550,000 entries)to be validated,it is included in Freeling. For words that are not in the dictionary arereducedatleastwithvalidrules forSpanish wordformation.

Also, colloquialwords andabbreviations are transformed usinga regular expression based on a dictionary of those sort of words, Fig.2 illustrates some rules. The text on the left side of the ar- rowisreplacedby thetext oftherightside.Twittertags suchas usernames,hashtags(topics),URLs,andemoticonsarehandledas specialtagsinourrepresentationtokeepthestructureofthesen- tence.

InFig.3,wecanseethelemmatizedtext afterapplyingFreel- ing. As we mentioned, the text is prepared with the Error cor- rection step (see Fig. 3(a)), then Freeling is applied to normal- ize words. Fig. 3(b) shows Freeling’s output where each token has the original word followed by the slashsymbol andits lexical information. The lexical information can be read as follows;

for instance, the token

orgulloso/AQ0MS0

^(proud) ^stands ^for

adjective as partof speech(AQ), masculine gender (M), andsin- gular number(S); thetoken

querer/VMIP1S0

(to want)stands for lemmatized main verb as part of speech (VM), indicative mood(I), presenttime(P), singularformoftheﬁrst person(1S);

positive_tag/NTE0000

standsfornountagaspartofspeech, andsoon.

Lexicalinformationisusedtoidentifyentities,stopwords,content wordsamongothers,itdependsonthesettingsofthe other parameters. The words are ﬁltered based on heuristic rules that takeintoaccountthelexicalinformationshowninFig.3(b).Finally, lexicalinformationisremovedinordertogetthelemmatizedtext depictedonFig.3(c).

3.1.6. Negation(neg)

Spanish negation markers might change the polarity of the message.Thus,weattachedthenegationcluetothenearestword, similar to the approaches used in Sidorov etal. (2013).A set of rules wasdesignedfor commonSpanish negationstructures that involvenegation markers, namely, no(not), nunca, jamás(never), andsin(without).Therulesareprocessedinorder,and,whenone ofthemmatches,theremainingrulesarediscarded.Wehavetwo sortsofrules;itdependsontheinputtext.Ifthetextisnotparsed byFreeling,afewrules(regularexpressions)areappliedtonegate thenearest wordtothenegation markerusingonlytheinforma- tion onthe text, e.g.,avoiding pronouns andarticles. Thesecond approachusesasetofﬁne-grainedrulestotakeadvantageofthe lexicalinformation,approximately50ruleswere designedconsid- eringthenegationmarkers.Thenegationmarkerisattachedtothe closestwordtothenegationmarker.Thesetofnegationrulesare available⁴.

Intheboxbelow,Pattern1andPattern2areexamplesofnega- tion rules (regularexpressions).A ruleconsists oftwo parts: the left sideofthe arrowrepresentsthetext tobe matched,andthe rightsideofthearrowisthestructuretobereplaced.Allrulesare basedon alinguisticmotivation takingintoaccount lexicalinformation.

Forexample,inthesentenceElcochenoesnibonitoniespacioso (Thecarisneithernicenorspacious),thenegationmarkernoisat- tachedtoits twoadjectivesno_bonito(notnice) andno_espacioso (notspacious),asitisshowedinPattern1,thenegationmarkeris attachedtogroup 3(\3) andgroup 4(\4) thatstand foradjective positionsbecauseofthe coordinatingconjunctionni.Thenumber ofgroupisidentiﬁedbyparenthesisintherulefromleft toright.

Negationmarkersareattachedtocontentwords(nouns,verbs,ad- jectives, andsome adverbs), e.g., ‘no seguir’(do not follow) is replaced by ‘no_seguir’, ‘no esbueno’ (itis not good) is replaced by

‘esno_bueno’,‘sincomida’(withoutfood)isreplacedby‘no_comida’. Fig.4exempliﬁesapairofthesenegationrules.

3.1.7. Emoticon(emo)

In the case of emotions, we classify more than 500 popular emoticons,includingtext emoticons,andthewhole setofemoti- cons(closeto1600)deﬁnedbyUnicode(2016)intothreeclasses:

positive,negativeorneutral,whicharereplacedbyapolarityword ordeﬁnitionassociatedtotheemoticonaccordingtotheUnicode standard.Theemoticonsconsideredaspositivearereplacedbythe

4http://ws.ingeotec.mx/ ^∼sadit/

(5)

Fig. 3. A step-by-step lemmatization of a tweet.

Fig. 4. An example of negation rules.

Table 1

An excerpt of the mapping table from Emoticons to its polarity words.

:) :D :P → positive

:( :-( :’( → negative

:-| U_U -.- → neutral

emoticonwithoutpolarity → unicode-text

wordpositive,negativeemoticons arereplaced bythewordnega- tive,neutralemotionsarereplacedbythewordneutral.Emoticons thatdonothaveapolarity, orareambiguous,arereplacedbythe associatedUnicodetext.Table1showsanexcerptofthedictionary thatmapsemoticonstotheircorrespondingpolarityclass.

3.1.8. Entity(del-ent)

We consider entities to be proper names, hashtags, urls or nicknames. However, nicknames (see usrparameter, Table3) is a particular feature in Twitter messages; thus, user names is an- otherparametertoseetheeffectontheclassificationsystem.User names,urlsandnumbers(seeurl,numparameters,Table3)could be groupedunderanespecialgeneric name.Entitiessuch asuser names and hashtags are identified directly by its corresponding especial symbol @ and#, andproper names are identifiedusing Freeling,thelexicalinformationusedtoidentifyapropernameis

“NP0000”.

3.1.9. Word-basedn-grams(n-words)

N-words are widely used in many NLP tasks, and they have also been used in sentiment analysis by Cui et al. (2015); Sidorov et al. (2013). N-words are word sequences. To compute the n-words, the text is tokenized and n-word are calculated from tokens. NLTK Tokenizer is used to identiﬁed word tokens. For example, let T=

‘‘the lights and shadows of your future’’

, its 1-words (unigrams) are each word alone, and its 2-words (bigrams) set are the sequences of two words, the set (W₂^T), and so on. For example, let W₂^T=

{

^the ^lights, ^lights ând, ând ^shadows, ^shadows ôf, ôf ^your,

your future

}

, then, given a text of m words, we obtain a set withatmostm−n+1elements. Generally,n-words areusedup to 2 or 3-words because it is uncommon to ﬁnd good matches of word sequences greater than three or four words (Jurafsky &

Martin,2009).

3.1.10. Character-basedq-grams(q-grams)

Inadditiontothetraditionallyn-wordsrepresentation,werep- resenttheresultingtextasq-grams.Aq-gramsisanagnosticlan- guagetransformationthatconsistsinrepresentingadocumentby allitssubstringoflengthq.Forexample,letT=abra_cadabra,its 3-gramssetare

Q₃^T=

{

âbr, ^bra, ^ra_, â_c, ^_ca, âca, ^cad, âda, ^dab

}

^,

(6)

Fig. 5. Examples of text representation.

so,giventext ofsize m characters, we obtaina set withatmost m−q+1elements.Noticethat thistransformationhandlewhite- spacesaspartofthetext.Sincetherewillbe q-gramsconnecting words,insomesense,applyingq-gramstotheentiretextcancap- turepartofthesyntacticinformationinthesentence.Therationale ofq-gramsistotacklemisspelledsentencesfromtheapproximate patternmatchingperspective (Navarro &Raﬃnot,2002), whereit isusedforeﬃcientsearchingoftextwithsomedegreeoferror.

A more elaborated example shows why the q-gram transformation is more robust to variations of the text. Let T= I_like_vanilla and T=I_lik3_vanila, clearly, both texts are differentand a plain algorithm will simply associate a low sim- ilaritybetween both texts.However, afterextracting its 3-grams, theresultingobjectsaremoresimilar:

Q₃^T=

{

Î_l, ^_li, ^lik, îke, ^ke_, ê_v, ^_va, ^van, âni, ^nil,

ill, lla

}

Q₃^T =

{

Î_l, ^_li, ^lik, îk3, ^k3_, ^3_v, ^_va, ^van, âni,

nil, ila

}

Just to ﬁxideas, let these two setsto be compared using the Jaccard’scoeﬃcientassimilarity,i.e.

|

^Q3^T∩Q₃^T

|

^Q3^T∪Q₃^T

|

⁼⁰^.⁴⁴⁸^.

Thesesetsaremoresimilarthantheonesresultingfromtheorig- inaltextsplitaswords

|{

^I, ^like, ^vanilla

}

^∩

{

^I, ^lik3, ^vanila

}|

|{

^I, ^like, ^vanilla

}

^∪

{

^I, ^lik3, ^vanila

}|

⁼⁰^.²

Theassumptionisthatamachinelearningalgorithmknowinghow toclassifyTwilldoabetterjobclassifyingTusingq-gramsthana plainrepresentation. Thisfact is used to createa robust method against misspelled words and other deliberated modiﬁcations to thetext.

3.2.Examplesoftexttransformationstage

Inordertoillustratethetexttransformationpipeline,weshow theexamples inFig.5(a) and(b).In Fig.5(a)we can seethe re- sulting text representation for a conﬁguration for words on IN- EGIbechmark,i.e., theparameters used totransform theoriginal textintothenewrepresentationarestemming(stem),reducedre- peatedsymbolsuptoonesymbol(del-d1),theremovalofdiacritic

Table 2

Datasets details from each competition tested in this work.

Benchmark Classes Total

Name Part Positive Neutral Negative None

INEGI Train 2908 986 1110 409 5413

Test 26,911 8868 9571 3361 48,711 54,124 TASS’15 Train 2884 670 2182 1482 7218

Test 22,233 1305 15,844 21,416 60,798 68,016

(del-diac),andcoarseningusers(usr),andnegations(neg).Theﬁnal textrepresentationisbasedon1-word.

Theotherexample,Fig.5(b),isaconﬁgurationforcharacter4- gram representationon the samebenchmark usingthe following parameters:the removal of diacritic(del-diac), coarsening emoticons(emo),coarseningusers(usr),changingwordsintolowercase (lc),negations(neg),andTFIDFisusedtoweightthetokens,ithas notextrepresentation.Theﬁnalrepresentationisbasedoncharac- ter4-grams,andtheunderscoresymbolisusedasspacecharacter toseparatewordsanditispartofthetokeninwhichitappears.

4. Benchmarksandparametersettings

Atthispoint,weareinthepositiontoanalyzetheperformance of described text representations on sentiment analysis benchmarks.Inparticular,wetestourrepresentationsinthetaskofde- terminingthe globalpolarity —fourpolarity levels:positive,neutral, negative, and none (no sentiment)— of each tweet in two benchmarks.

Table2 describesourbenchmarks. The INEGI benchmarkcon- sists ontweets geo-referencedto Mexico; the datawascollected and labeled between2014 and2015 by the Mexican Institute of Statistics and Geography (INEGI). The INEGI’s tweets come from thegeneralpopulationwithoutanyﬁlteringbeyonditsgeographic location. INEGI benchmark has a total of 54,124 tweets (in the Spanishlanguage).Thetagging processofINEGI datasetwascon- ducted througha web application (called pioanalisis⁵, itwas de- signedbythepersonneloftheInstitute).Eachtweetwasdisplayed

5http://cienciadedatos.inegi.org.mx/pioanalisis/#/login

(7)

Table 3

Parameter list and a brief description of their functionality.

Weighting schemes / Removing common words

Name Values Description

tﬁdf y es, no After the text is represented as a bag of words, it determines if the vectors are weighted using the TFIDF scheme. If it is no then the term frequency in the text is used as weight.

del-sw y es, no Determines if the stopwords are removed. It is related to TFIDF in the sense that a proper weighting scheme assigns a low weight for common words.

Morphological reductions

lem y es, no Determines if words sharing a common root are replaced by its root.

stem y es, no Determines if words are stemmed.

Transformations based on removing or replacing substrings

del-punc y es, no The punctuation symbols are removed if del-punc is yes , they are left untouched otherwise.

del-ent y es, no Determines if entities are removed in order to generalize the content of the text.

del-d1 y es, no If it is enabled then the sequences of repeated symbols are replaced by a single occurrence of the symbol.

del-d2 y es, no If it is enabled then the repeated sequences of two symbols are replaced by a single occurrence of the sequence.

del-diac y es, no Determines if diacritic symbols, e.g., accent symbols, should be removed from the text.

Coarsening transformations

emo y es, no Emoticons are replaced by its expressed emotion if it is enabled.

num y es, no Determines if numeric words are replaced by a common identiﬁer.

url y es, no Determines if URLs are left untouched or replaced by a unique url identiﬁer.

usr y es, no Determines if users mentions are replaced by a unique user identiﬁer.

lc yes, no Letters are normalized to be lowercase if it is enabled.

Handling negation words

neg y es, no Determines if negation operators in the text are normalized and directly connected with the modiﬁed object.

Tokenizing the transformation

n-words {1, 2} Determines the number of words used to describe a token.

q-grams {3, 4, 5, 6, 7} Determines the length in characters of the q -grams ( q ).

andhuman tagged itaspositive, neutral, negativeor none. After thisprocedure,everytweetwastaggedbyseveralhumans,thela- belwithmajorconsensus wasassignedasa ﬁnaltagged.Wedis- cardtweetsbeingontie.

On theother hand,our second benchmark isthe one usedin TASS’15 workshop (Taller de Análisis de Sentimientos en la SE- PLN) (Román et al., 2015). Here, the whole corpus containsover 68,000tweets,writteninSpanish, relatedtowell-known person- alities andcelebrities of severaltopics such aspolitics, economy, communication, massmedia, and culture. Thesetweets were ac- quired betweenNovember2011 andMarch2012. The whole cor- puswassplitintoatrainingset(about10%)andtestset(remaining 90%).Eachtweetwastaggedwithitsglobalpolarity(positive,negativeorneutral)ornopolarityatall(fourclassesintotal).Thetag- gingprocesswasdoneinasemi-automaticallywaywhereabase- line machine learningalgorithm classiﬁesthem, andthen all the taggedtweets aremanually checkedby humanexperts;formore detailsofthisdatabaseconstructionsee Románetal.,2015.

We partitioned INEGI in 10% fortraining and90% for testing, followingthesetupofTASS’15;thislargetest-setpursuesthegen- erality of the method. Hereafter, we name the test set as the gold-standard,andwe interchange bothnames assynonyms.The accuracy is the major score in both benchmarks, again because TASS’15usesthisscoreasitsmeasure.Wealsoreportthemacro- F1scoretohelptounderstandtheperformanceonheavilyunbal- anceddatasets,seeTable2.

In general, both benchmarks are full of errors, and these er- rorsvaryfromsimplemistakestodeliberatemodiﬁcationofwords and syntactic rules. However, it is worth to mention that INEGI is a collection of an open domain, andmoreover, it comes from thegeneralpublic;thenwecan seethefrequencyofmisspellings

andgrammatical errorsasa majordifference betweenINEGI and TASS’15.

Fig.6showsthesizeofthevocabularyasthenumberofwords inthecollectionincreases.TheHeaps’law,(Baeza-Yates&Ribeiro- Neto,2011),statesthatthegrowthofthevocabularyfollowsO(n^α) for0<

α

<1,foradocumentofsizen.Theﬁgureillustrates the growthrateofourbothbenchmarks,alongwithawell-writtenset ofdocuments,i.e.,classicBooksoftheSpanishliteraturefromthe Gutenbergproject (Gutenberg,2016). TheBooks collectionscurve is belowthan anyof our collections; its growthfactor is clearly smaller.Theprecise valuesof

α

^for^each^collection^are

α

TASS15= 0.718,

α

^INEGI=0.756,and

α

^Books=0.607,thesevaluesweredeter- minedwitharegressionovertheformulae.⁶There isasigniﬁcant differencebetweenthethreecollections,anditcorrespondstothe highamountoferrorsinTASS’15,and,thehigheroneinINEGI.

4.1. Parametersofthetexttransformations

As described in Section 3 the different text transformation methods areexplored inthisresearch. Table 3 complementsthis description by listing the different values these transformations have.Fromthetable,itcanbeobservedthatmostparameters are eithertheuseorabsenceoftheparticulartransformationwiththe exceptionsn-wordsandq-grams.

Basedonthedifferentvaluesoftheparameters,we cancount the number of different text transformation which is 7×2¹⁵= 229,369conﬁgurations(theconstant7correspondstothenumber oftokenizers).Evaluating allthesesetups, foreachbenchmark, is

6The tweets were slightly normalized removing all URLs and standardizing all characters to lowercase.

(8)

Fig. 6. On the left, the growth of the vocabulary in our benchmarks and a collection of books from the Gutenberg project. On the right, the vocabulary growth in 42 million tweets.

computationallyexpensive.Also, weperformthe sameexhaustive inthetestsettocomparetheachievedresultandthebestpossible underour approach.Alongwiththeseexperiments,wealsoeval- uatea number of experimentsto prove andcompare a series of improvements.Intheend,weevaluated closetoone millioncon- ﬁgurations.Forinstance,usinganIntel(R)Xeon(R)CPUE5-2630v2

@2.60GHz workstation, we need∼ 12 minutesin averagefora singleconﬁguration,running ona singlecore.Therefore, itneeds roughly24yearsofcomputingtime.Nonetheless,weusedasmall cluster tocompute all conﬁgurations insome weeks.Notice that thetimeofdeterminingthepart-of-the-speech,neededbyparam- etersstemandlem,isnotreportedsinceitwasexecutedonlyonce for all texts and loaded from a cache whenever is needed. The lemmatizationstepneeds closeto56mintotransformtheINEGI datasetinthesamehardware.

4.2.Aboutthescorefunctions

Theobjectiveofaclassiﬁeristomaximizeascorefunctionthat measuresthequalityofthepredictions.Wemeasureourcontribu- tionwithaccuracyandmacro-F1,whicharedeﬁnedasfollows.

accuracy=totalTP+totalTN totalsamples

Where, TP denotes the numberof true positives, TN the true negatives;FP andFN are the numberof false positivesandfalse negatives,respectively.Allthesemeasurescanbeparametrizedby classc. So,theaccuracyiscalculatedby dividingthe totalofcor- rectlyclassiﬁedsamplesbythetotalnumberofsamples.

precision_c= TPc

TPc+FPc

recallc= TPc

TPc+FNc

The F_1,_c isdeﬁnedastheharmonicmeanoftheprecision and therecall,perclass,deﬁnedasfollows.

F1,c=2·accuracy_c·recallc

accuracy_c+recallc

Themacro-F1scoreistheaverageofallper-classF1scores:

macro-F1= 1

|

L

|

c∈L

F1,c

whereListhesetofvalidclasses.Inthissense,macro-F1summa- rizestheprecisionandrecallscores,alwaystendingtosmallvalues since itisdeﬁned intermsofthe harmonicmean.It isworth to mentionthatmacro-F1canbesmallevenonhighaccuracyvalues, especially on unbalanceddatasets. The interested readeris refer- enced to an excellent survey on text classiﬁers andperformance metrics,inSebastiani(2008).

5. Experimentalanalysis

Thissectionisdevotedtodescribeandanalyzetheperformance oftheconfigurationspace, providethesufficientexperimentalev- idence to prove that q-gram tokenizers are better than n-words, atleastunderthesentimentanalysisdomaininSpanish.Further- more,wealsoprovidetheexperimental analysisforthecombina- tionoftokenizers,whichimprovesthewholeperformancewithout movingtoofarfromourtextclassifierstructure.

Weusebothtrainingandtestdatasetsinourexperiments.The performanceonthetrainingsetiscomputedusing5-foldcrossval- idation,andthe performance ontest set iscomputed directlyon the gold-standard. As previously described, training and test are disjointsets, see Table2 fordetails of ourbenchmarks. Asmen- tioned,theclassiﬁer wasﬁxed tobeSVM; weusetheimplemen- tationfrom theScikit-learn project (Pedregosa etal., 2011) using a linearkernel. Weuse thedefault parameters ofthe library;no additionaltuningwasperformedinthissense.

5.1. Aperformancecomparisonofn-wordsandq-grams

Fig.7showsthehistogramofaccuraciesforourconfiguration- spacein both training andtest partitions.Fig. 7(a) and(c) show theperformanceofconfigurationswithn-wordsastokenizer(unigrams andbigrams),fortrainingandtestdatasets respectively.It ispossibletoseethat theformispreserved, andalsothat TFIDF configurations can perform slightly better than those using only thetermfrequency.However,the accuracyrangebeingshared by bothkindsofconfigurationsislarge.

In contrast, Fig. 7(b) shows the performance of configurations with q-grams as tokenizers. Here, the improvement of the TFIDFclassismoresignificantthanthoseconfigurationsnotusing TFIDF;also,theperformanceachievedbytheq-gramswithTFIDF isconsistentlybetterthantheperformance ofthealln-wordcon- figurationsinourspace.Thisisalsovalidforthetest dataset,see Fig.7(d).

(9)

Fig. 7. Accuracy’s histogram, by tokenizer’s class, for the INEGI benchmark. The performance on the training set was computed with 5-folds. We select to divide each ﬁgure to show the effect of TFIDF, which it is essencial for q -grams’s performance.

Fig. 8 shows the performance of INEGI on conﬁgurations using q-grams astokenizers. Ontheleft, Fig.8(a) and(c)showthe performance of conﬁgurations without TFIDF. In train, the best performance is close to 0.57, and less than 0.58 in the test set.

Thebestperformingtokenizeris7-grams.WhenTFIDFisallowed, Table 8(b) and (d), the best performances are achieved, in both trainingandtest,closeto0.61inthetrainingsetandhigherinthe gold-standard.Thebestconﬁgurationsarethosewith5-gramsand 6-grams. The 5-grams is consistently better, it achieves accuracy valuesof0.6065and0.6148fortrainingandtestsets,respectively.

5.1.1. PerformanceontheTASS’15benchmark

TheperformanceonTASS’15issimilartothatfoundintheIN- EGI benchmark; however,TASS’15showsa higher sparsityofthe accuracy along therange on n-words,ranging from0.35 to close than 0.61.Inthetrainingset,the bestperformancesare achieved usingTFIDF.

Thebestconfigurationsarethoseusingq-grams,asdepictedin Fig. 9(b)and (d),where accuracy valuesachieve closeto 0.63in bothtrainingandtestsets.IncontrasttoINEGIandthetrainingset of TASS’15, the best performing q-gram tokenizer has no TFIDF; however, the configurations with TFIDF are tightly concentrated whichmeansthatismoreeasytopickagoodconfigurationunder arandomselection,orbytheinsightofanexpert.

Fig. 10 shows a ﬁner analysisof the performance of q-grams tokenizersinTASS’15.Wecanobservethat5-gramsappearasthe bestin thetrainingset andinthe gold-standardwithTFIDF,but thebestperformingconﬁgurationuses6-gramstokenizersandno

TFIDF;pleasenotethatTFIDFhasthebestaccuracyonthetrain- ingset,sowehavenotwaytoknowthisbehaviourwithouttesting allpossibleconfigurationsinthegold-standard.Also,thedifference betweenthebestTFIDFandthebestno-TFIDFconfigurationsisof around0.005; that isquite smalltodiscardthe currentbiasthat suggesttouseTFIDFconfigurations.

5.2.Top-kanalysis

Thissectionfocusonthestructuralanalysisofthebestkcon- ﬁgurations (based on theaccuracy score) ofour previous results.

Wecallthistechniquetop-kanalysis,anditdescribestheconfigu- rationswiththeempiricalprobabilityofaparametertobeenabled amongthe bestk configurations. Thescore valuesare definedas theminimumamongtheset.Themainideaistodiscoverpatterns onthecompositionofbestperformingconfigurations.Aswedou- blekateachrow,thenkand2ksharekconfigurationswhichpro- ducesa smoothly convergence to 0.5foreach probability. At the bestofourknowledge, thiskindofanalysishasneverbeen used intheliterature.

Alltablesinthissubsectionare inducedby theaccuracyscore (i.e., best k as measured with accuracy). Also, we display the macro-F1 score asa secondary measure ofperformance that can helptodescribethebehaviourofunbalancedmulti-classdatasets.

WeomittoshowthetokenizerprobabilitiesinfavorofFigs.8and 10;pleaseremindthatalmostalltopconﬁgurationsuseq-grams.

Table4showsthecompositionofINEGI’sbestconﬁgurationsin bothtrainingandtestsets.Aspreviouslyshown,almostallbestse-

(10)

Fig. 8. Accuracy’s histogram for q-gram conﬁgurations in the INEGI benchmark. As before, the performance on the training set was computed with 5-folds.

Table 4

Analysis of the k best conﬁgurations for the INEGI benchmark in both training and test datasets.

k Accuracy Macro-F1 tﬁdf del-sw lem s tem del-d1 del-d2 del-punc del-diac del-ent emo num url usr lc neg 1 0.6065 0.4524 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 1.00 2 0.6065 0.4524 1.00 0.00 0.00 1.00 0.00 0.00 0.50 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.50 4 0.6065 0.4524 1.00 0.00 0.00 1.00 0.00 0.00 0.50 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.50 8 0.6059 0.4511 1.00 0.00 0.00 1.00 0.00 0.00 0.50 0.00 1.00 1.00 0.50 0.00 1.00 0.00 0.50 16 0.6058 0.4568 1.00 0.19 0.00 1.00 0.19 0.00 0.44 0.13 0.81 1.00 0.38 0.19 1.00 0.19 0.56 32 0.6052 0.4507 1.00 0.31 0.00 0.69 0.25 0.00 0.47 0.19 0.56 1.00 0.31 0.38 1.00 0.44 0.53 64 0.6047 0.4516 1.00 0.22 0.00 0.78 0.44 0.00 0.50 0.33 0.66 1.00 0.38 0.19 1.00 0.53 0.52 128 0.6037 0.4643 1.00 0.20 0.00 0.77 0.45 0.03 0.50 0.31 0.53 1.00 0.42 0.28 1.00 0.58 0.49 256 0.6024 0.4489 1.00 0.14 0.00 0.77 0.36 0.09 0.50 0.40 0.51 1.00 0.44 0.43 1.00 0.62 0.51 512 0.6008 0.4315 1.00 0.17 0.00 0.73 0.42 0.17 0.50 0.43 0.41 1.00 0.41 0.48 0.99 0.62 0.51

a) Performance on the training dataset (5-folds)

k Accuracy Macro-F1 tﬁdf del-sw lem s tem del-d1 del-d2 del-punc del-diac del-ent emo num url usr lc neg 1 0.6148 0.4 4 42 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 2 0.6148 0.4 4 42 1.00 0.00 0.00 0.00 0.00 0.00 0.50 1.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 4 0.6136 0.4405 1.00 0.00 0.00 0.00 0.00 0.00 0.50 1.00 0.00 1.00 0.50 0.00 1.00 1.00 1.00 8 0.6135 0.4545 1.00 0.00 0.00 0.25 0.00 0.00 0.62 0.75 0.00 1.00 0.50 0.00 1.00 1.00 0.88 16 0.6134 0.4546 1.00 0.00 0.00 0.38 0.00 0.00 0.50 0.88 0.00 1.00 0.62 0.38 1.00 1.00 0.81 32 0.6130 0.4528 1.00 0.00 0.00 0.50 0.06 0.00 0.50 0.94 0.00 1.00 0.44 0.44 1.00 1.00 0.62 64 0.6119 0.4403 1.00 0.12 0.00 0.44 0.19 0.00 0.50 0.72 0.00 1.00 0.44 0.41 1.00 1.00 0.62 128 0.6112 0.4547 1.00 0.30 0.00 0.48 0.27 0.00 0.50 0.61 0.00 1.00 0.50 0.52 1.00 0.98 0.61 256 0.6099 0.4379 1.00 0.35 0.00 0.46 0.37 0.00 0.50 0.50 0.05 1.00 0.46 0.48 1.00 0.92 0.53 512 0.6083 0.4479 1.00 0.27 0.00 0.52 0.27 0.05 0.50 0.51 0.13 1.00 0.45 0.48 1.00 0.75 0.55

b) Performance on the gold-standard dataset

(11)

Fig. 9. Accuracy’s histogram, by tokenizer’s class, for the TASS benchmark. The performance on the training set was computed with 5-folds. We select to divide each ﬁgure to show the effect of TFIDF, which it is essencial for q -grams’s performance.

tupsenableTFIDF,andproperlyhandleemoticonsandusers.The parametersdel-sw,lem,del-d1,del-d2,num,andurl,arealmostde- activatedinbothtrainingandtestsets.Therestoftheparameters (stem,del-diac, del-ent, and neg) do not remain between training andtestsets.However,thelatersetofparametersaredisabledin thegold-standardbestconﬁgurations,exceptingforneg.Suchfact supports the idea that fasterconﬁgurations also can produce excellent performances. Please notice that lemmatization(lem) and stemming(stem)arealsodisabled,whicharethelinguisticopera- tionswithhighercomputationalcostsinourpipelineoftexttrans- formations.

Table 5 showsthe top-k analysis forTASS’15. Again, TFIDF is a commoningredientofthe majorityofthebetter configurations inthetrainingset;however,thebestonesdeactivatethisparam- eter to use only the frequency of the term; reflected in a mini- mumimprovement.Thetransformationsthatremainactiveinboth trainingandsetaredel-sw,del-d1,url,usr,lc,andneg.Thedeacti- vatedonesinbothsetsarelem,stem,del-d2,anddel-ent,andemo. The restofthe parametersthat changebetweentrainingandtest setsaretfidf,del-diac,andnum.Notethataskgrows,del-puncand emo,are closeto be randomchoices. It iscounterintuitive tosee theemoparameteroutsidethetop-kitems,thesamehappensfor the del-ent parameter.The emo parameter isused tomap emoti- consandemojistosentiments,anddel-entisanheuristicdesigned to generalize the sentiment expression in the text (see Table 3).

Thisbehaviourrememberusthat,intheend,everything depends onthe particulardistributionofthedataset.Ingeneral,itis clear that there isno a rule-of-thumb to compute thebest conﬁgura-

tion.Therefore,aprobabilisticapproach,asitistheoutputoftop- k analysis, is useful to reduce the cost of the exploration of the conﬁgurationspace.

5.3.Improvingtheperformancewithcombinationoftokenizers

Inprevious experiments,we performedan exhaustive evalua- tionoftheconfiguration space;then,toimproveover ourresults weneedtomodifytheconfigurationspace.Insteadofaddingmore complextexttransformations,wedecidetousemorethanoneto- kenizerper configuration.Moredetailed, thereexists127possible combinationsoftokenizers,thatis,thepowersetof

{

²^-words^,¹^-words^,³^-grams^,⁴^-grams^,⁵^-grams^,⁶^-grams^,⁷^-grams

}

^,

minus the empty set. For this experiment, we only applied the expansion of tokenizers to the best configurations found in the previous experiments,since performing an exhaustiveanalysis of thenewconfigurationspacebecomesunfeasible.Thehypothesisis thatthepreviousbestconfigurationswillbealsocomposesomeof thebestconfigurationsinthenewspace,thisisafairassumption thatnevergetworstunderanexhaustiveanalysis.

Fig. 11(a) shows the performance of 4064 configurations that correspondtoallcombinationsoftokenizersoverthetop-32con- figurations on the training set, see Table 4. The performance in bothtrainingandtestsets isprettysimilar, andsignificantlybet- terthanthatachievedwithasingletokenizer(Table4).InTable6 andFig.11,wecan seeasignificant improvementwithrespective to single tokenizers. The top-k analysis for the test set is listed