semi-supervised features Robust multilingual Named Entity Recognition with shallow Artiﬁcial Intelligence

(1)

Contents lists available atScienceDirect

Artiﬁcial Intelligence

www.elsevier.com/locate/artint

Robust multilingual Named Entity Recognition with shallow semi-supervised features

Rodrigo Agerri

^∗

, German Rigau

IXANLPGroup,UniversityoftheBasqueCountry(UPV/EHU),Donostia–SanSebastián,Spain

a r t i c l e i n f o a b s t r a c t

Articlehistory:

Received1September2015 Receivedinrevisedform11May2016 Accepted15May2016

Availableonline19May2016

Keywords:

NamedEntityRecognition InformationExtraction Clustering

Semi-supervisedlearning NaturalLanguageProcessing

We present a multilingual Named EntityRecognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information withclustering semi-supervisedfeaturesinduced onlarge amounts of unlabeled text.Understanding viaempirical experimentationhowto effectivelycombine various types of clusteringfeatures allows us to seamlessly export our system to other datasetsandlanguages.Theresultisasimplebuthighlycompetitivesystemwhichobtains stateoftheartresultsacrossﬁvelanguagesandtwelvedatasets.Theresultsarereported onstandard shared task evaluationdata suchas CoNLL for English,Spanishand Dutch.

Furthermore,anddespitethelackoflinguisticallymotivatedfeatures,wealsoreportbest resultsfor languagessuchas Basqueand German.Inaddition,wedemonstratethatour methodalsoobtainsverycompetitiveresultseven whentheamountofsuperviseddata iscutbyhalf,alleviatingthedependencyonmanuallyannotateddata.Finally,theresults showthatouremphasisonclusteringfeaturesiscrucialtodeveloprobustout-of-domain models.Thesystemandmodelsarefreelyavailabletofacilitateitsuseandguaranteethe reproducibilityofresults.

©²⁰¹⁶^Elsevier^B.V.^All^rights^reserved.

1. Introduction

Anamedentitycanbe mentionedusingagreatvarietyofsurfaceforms(BarackObama,PresidentObama,Mr.Obama, B.Obama,etc.)andthesamesurface formcan refertoavarietyofnamedentities.Forexample,accordingtotheEnglish Wikipedia,theform‘Europe’canambiguouslybeusedtoreferto18differententities,includingthecontinent,theEuropean Union,variousGreek mythologicalentities,arockband,some musicalbums,amagazine,a shortstory,etc.¹ Furthermore, it ispossible to referto a named entityby meansof anaphoricpronouns andco-referent expressions such as‘he’, ‘her’,

‘their’,‘I’, ‘the35yearold’,etc. Therefore,inorderto providean adequateandcomprehensiveaccount ofnamedentities in text it is necessary to recognize themention of a named entity and to classify it by a pre-deﬁnedtype (e.g, person, location, organization). Named Entity Recognition andClassiﬁcation (NERC)is usually a requiredstep to perform Named EntityDisambiguation(NED),namelytolink‘Europe’totherightWikipediaarticle,andtoresolveeveryformofmentioning orco-referringtothesameentity.

NowadaysNERCsystemsarewidelybeingusedinresearchfortaskssuchasCoreferenceResolution[51],NamedEntity Disambiguation [19,27,29,38,26] forwhicha lot ofinterest hasbeen createdby the TACKBP sharedtasks [32],Machine

*

Correspondingauthor.

E-mailaddresses:[email protected](R. Agerri),[email protected](G. Rigau).

1 http://en.wikipedia.org/wiki/Europe_(disambiguation).

http://dx.doi.org/10.1016/j.artint.2016.05.003 0004-3702/©²⁰¹⁶^Elsevier^B.V.^All^rights^reserved.

(2)

Translation [3,33,5,35], AspectBased Sentiment Analysis [37,11,50,49], Event Extraction [22,2,31,20,30] and Event Order- ing[41].

Moreover, NERC systems are integrated in the processing chain of many industrial software applications, mostly by companies offeringspecificsolutionsforaparticularindustrialsectorwhichrequirerecognizingnamedentities specificof their domain. There istherefore a clear interest inboth academicresearch andindustry to develop robust andefficient NERCsystems:ForindustrialvendorsitisparticularlyimportanttodiversifytheirservicesbyincludingNLPtechnologyfor avarietyoflanguageswhereasinacademicresearchNERCisoneofthefoundationsofmanyotherNLPend-tasks.

Most NERCtaggers aresupervised statisticalsystems that extractpatterns andtermfeatures which are consideredto be indicationsofNamed Entity (NE)types usingthemanually annotatedtraining data(extractingorthographic, linguistic and other typesof evidence) andoften external knowledgeresources. As in other NLPtasks, supervisedstatistical NERC systems are more robust and obtain better performance on available evaluation sets, although sometimes the statistical modelscanalsobecombinedwithspeciﬁcrulesforsomeNEtypes.Forbestperformance,supervisedstatisticalapproaches require manually annotatedtraining data,which is both expensiveandtime-consuming. Thishas seriously hinderedthe development ofrobusthighperformingNERCsystemsformanylanguagesbutalsoforother domainsandtext genres[45, 54],inwhatwewillhenceforthcall‘out-of-domain’evaluations.

Moreover, supervisedNERC systems often requireﬁne-tuning for each language and, assome of the features require language-speciﬁcknowledge,thisposesyetanextracomplicationforthedevelopmentofrobustmultilingualNERCsystems.

Forexample,itiswell-known thatinGermaneverynouniscapitalizedandthat compoundsincludingnamedentitiesare pervasive.Thisalsoapplies toagglutinativelanguagessuchasBasque,Korean,Finnish,Japanese,HungarianorTurkish.For this type of languages,it hadusually been assumedthat linguisticfeatures (typically Part ofSpeech (POS) andlemmas, but also semantic features based on WordNet, for example) and perhaps speciﬁc hand-crafted rules, were a necessary condition for good NERC performance as they would allow to capture better the mostrecurrent declensions (cases) of named entitiesforBasque[4]orto addressproblemssuchassparsity andcapitalizationofevery nounforGerman[23,7, 8].This language dependencywas easy to see inthe CoNLL2002and 2003tasks,in whichsystems participatingin the two availablelanguagesforeacheditionobtainedingeneraldifferentresultsforeach language.Thissuggeststhat without ﬁne-tuningforeachcorpusandlanguage,thesystemsdidnotgeneralizewellacrosslanguages[46].

ThispaperpresentsamultilingualandrobustNERCsystembasedon simple,generalandshallowfeatures thatheavily relies on word representation features for high performance. Even though we do not use linguistic motivated features, our approachalsoworkswellforinﬂected languagessuchasBasqueandGerman.Wedemonstratetherobustness ofour approachbyreportingbestresultsforﬁvelanguages(Basque,Dutch,German,EnglishandSpanish)on12differentdatasets, includingsevenin-domainandeightout-of-domainevaluations.

1.1. Contributions

Themaincontributionsofthispaperarethefollowing:First,weshowhowtoeasilydeveloprobustNERCsystemsacross datasetsandlanguageswithminimalhumanintervention,evenforlanguageswithdeclensionand/orcomplexmorphology.

Second,weempiricallyshowhowtoeffectivelyusevarioustypesofsimplewordrepresentationfeaturestherebyproviding aclearmethodologyforchoosingandcombiningthem.Third,wedemonstratethatoursystemstillobtainsverycompetitive results evenwhenthesuperviseddatais reducedby half(even lessinsomecases), alleviatingthe dependencyoncostly handannotateddata.Thesethreemaincontributionsarebasedon:

1. Asimpleandshallowrobustsetoffeaturesacrosslanguagesanddatasets,eveninout-of-domainevaluations.

2. Thelackoflinguisticmotivatedfeatures,evenforlanguageswithagglutinative(e.g.,Basque)and/or complexmorphol- ogy(e.g.,German).

3. Aclearmethodologyforusingandcombiningvarioustypesofwordrepresentationfeaturesbyleveragingpublicunla- beleddata.

Our approachconsistsofshallow localfeatures complementedby three typesofwordrepresentation (clustering)features: Brown clusters [10], Clark clusters [15] and K-means clusters on top of the word vectors obtained by using the Skip-gram algorithm[39].Wedemonstratethat combining andstacking differentclustering features induced fromvarious datasources(Reuters,Wikipedia,Gigaword,etc.)allowstocoverdifferentandmorevariedtypesofnamedentitieswithout manualfeaturetuning.Eventhoughourapproachismuchsimplerthanmost,weobtainthebestresultsforDutch,Spanish andEnglishandcomparableresultsinGerman(onCoNLL2002and2003).WealsoreportbestresultsforGermanusingthe GermEval2014sharedtaskdataandforBasqueusingtheEgunkariatestset[4].

We report out-of-domain evaluations inthree languages(Dutch, English andSpanish) using fourdifferent datasets to compare our systemwiththe best publicly available systems forthose languages: Illinois NER [52] for English,Stanford NER[24]forEnglishandSpanish,SONAR-1NERDforDutch[21]andFreelingforSpanish[47].Weoutperformeveryother system intheeight out-of-domainevaluationsreported inSection 4.3. Furthermore,theout-of-domainresults showthat ourclusteringfeaturesprovideasimpleandeasymethodtoimprovetherobustnessofNERCsystems.

Finally,andinspired byprevious work [34,9]we measurehowmuch supervisionisrequiredto obtainstate oftheart results.InSection4.2weshowthatwecanstill obtainverycompetitiveresultsreducing thesuperviseddatabyhalf(and

(3)

sometimes evenmore).This,together withthe lackoflinguisticfeatures, meansthat oursystemconsiderablysaves data annotationcosts,whichisquiteconvenientwhentryingtodevelopaNERCsystemforanewlanguageand/ordomain.

Our system learnsPerceptron models [17] using the Machine Learning machinery provided by the Apache OpenNLP project² withour own customized (local andclustering) features. Our NERCsystem is publicly available anddistributed undertheApache 2.0LicenseandpartoftheIXApipestools[1].Everyresultreportedinthispaperisobtainedusingthe conllevalscriptfromtheCoNLL2002andCoNLL2003sharedtasks.³ Toguaranteereproducibilityofresultswealsomake publicly availablethe modelsandthe scriptsused toperformthe evaluations.The system,models andevaluationscripts canbefoundintheixa-pipe-nerc website.⁴

NextSectionreviewsrelatedwork,focusingonbestperforming NERCsystemsforeach languageevaluatedonstandard shared evaluationtaskdata.Section 3presentsthe designof oursystemandouroverall approach toNERC.In Section4 we report the evaluationresults obtained by our system for5 languages (Basque, Dutch, German, English and Spanish) on12 differentdatasets,distributedin7 in-domainand8out-of-domainevaluations.Section 5 discussesthe resultsand contributionsofourapproach.InSection6wehighlightthemainaspectsofourworkprovidingsomeconcludingremarks andfutureworktobedoneusingourNERCapproachappliedtoothertextgenres,domainsandsequencelabelingtasks.

2. Relatedwork

TheNamedEntityRecognitionandClassification(NERC)taskwasfirstdefinedfortheSixthMessageUnderstandingCon- ference(MUC6)[44].TheMUC6tasksfocusedonInformationExtraction(IE)fromunstructuredtextandNERCwasdeemed tobeanimportantIEsub-taskwiththeaimofrecognizingandclassifyingnominalmentionsofpersons,organizationsand locations, and alsonumeric expressions of dates, money, percentageandtime. In the following years,research on NERC increasedasit was consideredto be acrucial sourceofinformation forother NaturalLanguageProcessing tasks suchas Question Answering(QA) andTextualEntailment(RTE)[44].Furthermore,whileMUC 6was solelydevoted toEnglish as targetlanguage,theCoNLLsharedtasks(2002and2003)boostedresearchonlanguageindependentNERCfor3additional targetlanguages:Dutch,GermanandSpanish[57,58].

ThevariousMUC,ACEandCoNLLevaluationsprovidedaveryconvenientframeworktotestandcompareNERCsystems, algorithms and approaches. They provided manually annotated data for training andtesting the systems as well as an objectiveevaluationmethodology.Usingsuch framework,research rapidlyevolvedfromrule-basedapproaches(consisting ofmanually handcrafted rules)to language independent systemsfocused on learning supervisedstatisticalmodels. Thus, whileintheMUC6competition5out of8systemswere rule-based,inCoNLL200316teamsparticipatedintheEnglish taskallusingstatistical-basedNERC[44].

2.1. Datasets

Table 1describesthe12datasetsusedinthispaper.Theﬁrsthalfliststhecorporausedforin-domainevaluationwhereas thelowerhalfcontainstheout-of-domaindatasets.TheCoNLLNERsharedtasksfocusedonlanguageindependentmachine learning approachesfor4 entitytypes: person,location,organization and miscellaneous entities.The2002editionprovided manually annotateddata inDutch andSpanishwhereas in 2003thelanguages wereGerman andEnglish. Inaddition to theCoNLLdata,forEnglishwealsousetheformalrunofMUC7andWikigoldforout-of-domainevaluation.Verydetailed descriptionsofCoNLLandMUCdatacaneasilybefoundintheliterature,includingthesharedtaskdescriptionsthemselves [13,57,58],sointhefollowingwewilldescribetheremaining,newerdatasets.

TheWikigoldcorpus consistsof39KwordsofEnglishWikipediamanuallyannotatedfollowingtheCoNLL2003guide- lines [46].For SpanishandDutch, wealso useAncora2.0[56] andSONAR-1 [21] respectively. SONAR-1isa one million wordDutchcorpuswithbothcoarse-grainedandﬁne-grainednamedentityannotations.Thecoarse-grainedlevelincludes product andevent entitytypesinadditiontothefourtypesdeﬁnedinCoNLLdata.Ancoraaddsdate andnumber typesto theCoNLLfourmaintypes.InBasquetheonlygoldstandardcorpusisEgunkaria[4].AlthoughtheBasqueEgunkariadataset isannotatedwithfourentitytypes,themiscellaneous classisextremelysparse,occurringonly inaproportionof1to 10.

Thus,inthetrainingdatathereare156entitiesannotatedasMISCwhereaseachoftheotherthreeclassescontainaround 1200entities.

In the datasets described so far, named entities were assumed to be non-recursive and non-overlapping. During the annotationprocess, ifanamed entitywas embedded ina longerone, thenonly thelongestmentionwas annotated. The exceptionsaretheGermEval 2014sharedtaskdataforGerman andMEANTIME,wherenestedentitiesare alsoannotated (bothinnerandouterspans).

TheGermEval2014NERsharedtask[7]aimedatimprovingthestateoftheartofGermanNERCwhichwas perceived tobecomparativelylowerthantheEnglishNERC.TwomainextensionswereintroducedinGermEval2014;(i) ﬁnegrained namedentitysub-typestoindicatederivations and compounds;(ii) embeddedentities (andnotonlythelongestspan) are

2 http://opennlp.apache.org/.

3 http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt.

4 https://github.com/ixa-ehu/ixa-pipe-nerc.

(4)

Table 1

Datasetsusedfortraining,developmentandevaluation.MUC7:onlythreeclasses(LOC,ORG,PER)oftheformalrunareusedforout-of-domainevaluation.

AstherearenotstandardpartitionsofSONAR-1andAncora2.0,thefullcorpuswasusedfortrainingandlaterevaluatedin-out-of-domainsettings.

Corpus Source Number of tokens and named entities

train dev test

tok ne tok ne tok ne

In-domain datasets

en CoNLL 2003 Reuters RCV1 203621 23499 51362 5942 46435 5648

de CoNLL 2003 Frankfurter Rundschau 1992 206931 11851 51444 4833 51943 3673

GermEval 2014 Wikipedia/LCC news 452853 31545 41653 2886 96499 6893

es CoNLL 2002 EFE 2000 264715 18798 52923 4352 51533 3558

nl CoNLL 2002 De Morgen 2000 199069 13344 36908 2616 67473 3941

eu Egunkaria Egunkaria 1999–2003 44408 3817 15351 931

Out-of-domain datasets

en MUC7 Newswire 53749 3514

Wikigold Wikipedia 2008 39007 3558

MEANTIME Wikinews 2013 13957 1432

nl SONAR-1 Various genres 1000000 62505

es Ancora 2.0 Newswire 547198 36938

Table 2

Featuresofbestpreviousin-domainresults.Local:shallowlocalfeaturesincludingcapitalization,wordshape,etc.;Ling:linguisticfeaturessuchasPOS, lemma,chunks andsemanticinformationfrom Wordnet;Global:globalfeatures;Gaz:gazetteers; WR:wordrepresentationfeatures;Rules:manually encodedrules;Ensemble:stackofclassiﬁersorensemblesystem;Public:ifthesystemispubliclydistributed.Res:Ifanyexternalresourcesusedare publiclydistributedtoallowre-training.

System Local Ling Global Gaz WR Rules Ensemble Public Res

Ratinov and Roth (2009)

Passos et al. (2014)

ExB

Faruqui et al. (2010)

Carreras et al. (2002)

Clark and Curran (2003)

Sonar nerd

Alegria et al. (2006)

ixa-pipe-nerc

annotated. Intotal,thereare12typesforclassiﬁcation:person,location, organization,other plustheir sub-typesannotated attheirinnerandouterlevels.

Finally, theMEANTIME corpus[42] is amultilingual (Dutch,English, ItalianandSpanish) publicly available evaluation set annotatedwithinthe Newsreaderproject.⁵ Itconsistsof120documents,divided into4topics:Apple Inc.,Airbusand Boeing, GeneralMotors,Chrysler andFord,andthestock market.The articlesare selectedinsuch a waythat thecorpus contains different articlesthat dealwiththe sametopicover time (e.g.launchofa newproduct, discussionofthe same ﬁnancialindexes).Moreover,itcontainsnestedentitiessotheevaluationresultswillbeprovidedintermsoftheouterand the innerspans ofthenamedentities. MEANTIMEincludessixnamedentitytypes: person,location,organization, product, ﬁnancial andmixed.

2.2. Relatedapproaches

Named entityrecognitionisataskwithalonghistoryinNLP.Therefore,wewillsummarizethoseapproachesthat are most relevant to ourwork, especially those we will directly compared within Section 4. Since CoNLLshared tasks, the mostcompetitiveapproacheshavebeensupervisedsystemslearningCRF,SVM,MaximumEntropy orAveragedPerceptron models.Inanycase,whilethemachinelearningmethodisimportant,ithasalsobeendemonstratedthatgoodperformance mightlargelybeduetothefeaturesetused[16].Table 2providesanoverviewofthefeaturesusedbypreviousbestscoring approachesforeachoftheﬁvelanguagesweaddressinthispaper.

Traditionally,localfeatureshaveincludedcontextualandorthographicinformation,aﬃxes,character-basedfeatures,pre- dictionhistory,etc.AsarguedbytheCoNLL2003organizers,nofeaturesetwasdeemedtobeidealforNERC[58],although manyapproachesforEnglishrefertoZhangandJohnson[61]asausefulgeneralapproach.

5 http://www.newsreader-project.eu.

(5)

Linguisticinformation. SomeoftheCoNLLparticipantsuselinguisticinformation(POS,lemmas,chunks,butalsospecificrules orpatterns)forDutchandEnglish[12,16],althoughthesetype offeatures wasdeemedtobemostimportantforGerman, forwhichtheuseoflinguisticfeaturesispervasive[7].Thisiscausedby thesparsitycausedbythedeclensioncases,the tendencytoformcompoundscontainingnamedentitiesandbythecapitalizationofeverynoun[23].Forexample,thebest system amongthe 11 participantsin GermEval 2014, ExB, uses morphological features andspecific suffix lists aimed at capturingfrequentpatternsintheendingsofnamedentities[28].

In agglutinativelanguages such asBasque, whichcontains declension casesfor namedentities, linguistic features are considered to be a requirement. Forexample, the countryname ‘Espainia’(Spain inBasque) can occur in severalforms, Espainian,Espainiera,Espainiak, Espainiarentzat,Espainiako,andmanymore.⁶ Linguistic informationhasbeenusedtotreat this phenomenon. The only previous work for Basque developed Eihera, a rule-based NERC system formalized as ﬁnite state transducerstotake intoaccountdeclensionclasses[4].ThefeaturesofEiheraincludeword,lemma,POS,declension case, capitalized lemma,etc. Thesefeatures are complemented withgazetteers extractedfromthe EuskaldunonEgunkaria newspaperandsemanticinformationfromtheBasqueWordNet.

Gazetteers. Dictionaries arewidely used toinject worldknowledge via gazetteermatches asfeatures inmachine learning approachesto NERC. Thebest performing systems carefullycompile their own gazetteers froma variety ofsources [12].

Ratinov and Roth [52] leverage a collection of 30gazetteers and matches against each one are weighted as a separate feature.Inthiswaytheytrust eachgazetteer toadifferentdegree.Passosetal.[48]carefullycompiled alarge collection of English gazetteers extractedfrom US Census data and Wikipediaand applied them to the process of inducing word embeddingswithverygoodresults.

While itis possibleto automaticallyextract themfrom variouscorpora orresources, they still requirecarefulmanual inspection ofthe target data.Thus, ourapproach onlyuses offthe shelf gazetteerswhenever they arepublicly available.

Furthermore,ourmethodcollapsesevery gazetteerintoonedictionary. Thismeansthat we onlyadda featureper token, insteadofafeaturepertokenandgazetteer.

Globalfeatures. The intuitionbehind non-local(or global)features isto treatsimilarlyall occurrencesof thesamenamed entityinatext.Carrerasetal.[12]proposedamethodtoproducethesetofnamedentitiesforthewholesentence,where theoptimalsetofnamedentitiesforthesentenceisthecoherentsetofnamedentitieswhichmaximizesthesummationof conﬁdencesofthenamedentitiesintheset.RatinovandRoth[52] developedthreetypesofnon-localfeatures,analyzing globaldependenciesinawindowofbetween200and1000tokens.

Wordrepresentations. Semi-supervised approaches leveraging unlabeledtext hadalready been applied to improveresults invariousNLPtasks.Morespeciﬁcally, ithadbeenpreviouslyshownhowto applyBrownclusters[10] forChineseWord Segmentation[36],dependencyparsing[34],NERC[55]andPOStagging[9].

RatinovandRoth[52] usedBrownclustersasfeaturesobtaining whatwas atthe timethebestpublished resultofan EnglishNERCsystemontheCoNLL2003testset.Turianetal.[60]madearatherexhaustivecomparisonofBrownclusters, CollobertandWeston’sembeddings[18]andHLBLembeddings[43]toimprovechunkingandNERC.Theyshowthatinsome casesthecombinationofwordrepresentationfeatureswas positivebut, althoughtheyused RatinovandRoth’ssystemas startingpoint,theydidnotmanagetoimproveoverthestateoftheart.Furthermore,theyreportedthatBrownclustering featuresperformedbetterthanthewordembeddings.

Passosetal.[48] extendthe Skip-gramalgorithmtolearn50-dimensionallexiconinfusedphraseembeddingsfrom22 differentgazetteersandtheWikipedia.Theresultingembeddingsareusedasfeaturesbyscalingthembyahyper-parameter whichisarealnumbertunedonthe developmentdata.Passosetal.[48] reportbestresultsupto dateforEnglishNERC onCoNLL2003testdata,90.90F1.

ThebestGermanCoNLL2003system(anensemble)was outperformedbyFaruqui etal.[23].TheytrainedtheStanford NER system[24], whichuses a linear-chain ConditionalRandom Field(CRF)witha variety offeatures, includinglemma, POS tag, etc. Crucially,they included “distributional similarity” features in the form of Clark clusters [15] induced from largeunlabeledcorpora:theHugeGermanCorpus(HGC) ofaround175MtokensofnewspapertextandthedeWaccorpus [6]consistingof1.71Btokens ofweb-crawleddata.Using the clustersinduced fromdeWac asaformof semi-supervision improvedtheresultsoverthebestCoNLL2003systemby4pointsinF1.

Ensemblesystems. ThebestparticipantoftheEnglishCoNLL2003sharedtaskusedtheresultsoftwoexternallytrainedNERC taggerstocreateanensemblesystem[25].Passosetal.[48]developastackedlinear-chainCRFsystem:theytraintwoCRFs withroughly thesamefeatures;the secondCRFcan condition onthe predictionsmadeby theﬁrst CRF.Their“baseline”

systemusesasimilar localfeaturesetasRatinovandRoth’s[52]butcomplementedwithgazetteers.Theirbaseline system combinedwiththeirphraseembeddingstrainedwithinfusedlexiconsallowthemtoreportthebestCoNLL2003resultso far.

ThebestsystemoftheGermEval2014taskbuiltanensembleofclassiﬁersandpatternextractorstoﬁndthemostlikely tag sequence[28]. Theypaidspecialattentionto out ofvocabulary wordswhich areaddressedby semi-supervisedword

6 English:inSpain,toSpain,Spain(intransitiveclause),forSpain,inSpain.

(6)

Table 3

FeaturesgeneratedfortheBasquesentence“Morrasmundukotxapeldunizanzenjuniorretan1994an,Ekuadorkohiriburuan,Quiton”.English:Morraswas juniorworldchampionin1994,inthecapitalofEcuador,Quito.Currenttokenis‘Ekuadorko’.

Feature wi−2 wi−1 wi wi+1 wi+2

Token w=^1994an ^w=^, ^w=^ekuadorko ^w=^hiriburuan ^w=^,

Token shape wc=^1994an,4d ^wc=^„other ^wc=ekuadorko,ic wc=hiriburuan,lc wc=^„other

Previous pred pd=^null ^pd=^other ^pd=^null ^pd=^null ^pd=^other

Brown token bt=⁰¹¹¹ ^bt=⁰⁰¹⁰ ^bt=⁰¹⁰¹

bt=⁰¹¹¹¹¹ ^bt=⁰⁰¹⁰⁰¹ ^bt=⁰¹⁰¹¹⁰

Brown token, class c,bt=^4d,0111 ^c,bt=^ic,0010 ^c,bt=^lc,0101

c,bt=^4d,011111 ^c,bt=^ic,001001 ^c,bt=^lc,010111

Clark-a ca=158 ca=O ca=175 ca=184 ca=O

Clark-b cb=¹⁴⁹ ^cb=^O ^cb=¹⁷⁶ ^cb=¹⁰⁴ ^cb=^O

Word2vec-a w2va=55 w2va=O w2va=14 w2va=14 w2va=O

Word2vec-b w2vb=524 w2vb=O w2vb=464 w2vb=139 w2vb=O

Preﬁx (wi) pre=^{Eku; pre}=^Ekua

Suﬃx (wi) suf=o; suf=ko; suf=rko; suf=orko

Bigram (wi) pw,w=„Ekuadorko; pwc,wc=other,ic; w,nw=Ekuadorko,hiriburuan; wc,nc=ic,lc Trigram (wi) ppw,pw,w=1994an„,Ekuadorko; ppwc,pwc,wc=4d,other,ic;. . .

char n-grams (wi) ng=^{adorko; ng}=^{rko; ng}=^{dorko; ng}=^{ko; ng}=^orko. . .

representationfeaturesandanensembleofPOStaggers.Furthermore,remainingunknowncandidatementionsaretackled bylook-upviatheWikipediaAPI.

Apartfromthe featuretypes,thelast twocolumns ofTable 2refer towhetherthesystems arepublicly availableand whetheranyexternalresourcesusedfortrainingaremadeavailable(e.g.,inducedwordembeddings,gazetteersorcorpora).

This isdesirableto be abletore-train thesystems ondifferentdatasets.Forexample, wewouldhavebeen interested in training the StanfordNER systemwiththefull Ancoracorpus fortheevaluationpresentedinTable 16,buttheir Spanish clusterlexiconisnotavailable.Alternatively,wewouldhavelikedtotrainoursystemwiththesameAncorapartitionused totrainStanfordNER,butthatisnotavailableeither.

3. Systemdescription

The design of ixa-pipe-nerc aims at establishing a simple and shallow feature set, avoiding any linguistic motivated features,withtheobjectiveofremovinganyrelianceoncostlyextragoldannotations(POStags,lemmas,syntax,semantics) and/or cascading errorsifautomatic language processors are used.The underlyingmotivation isto obtain robust models to facilitatethe development ofNERC systemsforother languagesanddatasets/domains whileobtaining state ofthe art results.Oursystemconsistsof:

•^Local,^shallow^features^based^mostly^onorthographic,wordshapeandn-gramfeaturesplustheircontext.

•^Three^typesôf^simple^clustering^features,^basedônûnigram^matching.

•^Publicly^availablegazetteers,widelyusedinpreviousNERCsystems[58,44].

Table 3providesanexampleofthefeaturesgeneratedbyoursystem.⁷ 3.1. Localfeatures

The localfeaturesconstitute ourbaseline systemontopofwhichtheclustering featuresareadded.Weimplementthe followingfeatureset,partiallyinspiredbypreviouswork[61]:

•^Token:^Current^lowercase^token^(w),^namely,ekuadorko inTable 3.

•^Token^Shape: ^Current^lowercase^token^(w) plus currenttokenshape(wc),wheretokenshape consistof:(i)Thetoken iseitherlowercaseora2digitwordora4digitword;(ii)Ifthetokencontains digits,then whetheritalsocontains letters, orslashes, or hyphens, orcommas, orperiods oris numeric; (iii)The token is all uppercase lettersor is an acronymor isaone letteruppercasewordor startswithcapitalletter.Thus,inTable 31994an isa4digit word(4d), Ekuadorko hasaninitialcapitalshape(ic)andhiriburuan islowercase(lc).

•^Previousprediction:thepreviousoutcome(pd)forthecurrenttoken.Thepreviouspredictionsinourexamplearenull becausethesewordshavenotbeenseenpreviously,exceptforthecomma.

7 Toavoidtoomuchrepetition,Brown,Trigramandcharactern-gramfeatureshavebeenabbreviated.

(7)

• ^Sentence:^Whether^the^tokenîs^the^beginningôf^the^sentence.^Noneôf^the^tokens ⁱⁿôurêxampleîsât^the^beginning ofthesentence,sothisfeatureisnotactiveinTable 3.

• ^Prefix:^Two^prefixes^consistingôf^the^first^threeând^four^charactersôf^the^current^token:^{Eku and}Êkua.

• ^Suffix:^The^four^suffixesôf^lengthône^to^four^from^the^last^four^charactersôf^the^current^token.

• ^Bigram:^Bigrams^including^the^current^token^and^the^token^shape.

• ^Trigram:^Trigrams^including^the^current^token^and^the^token^shape.

• ^Character^n-gram:Âll^lowercase^character^bigrams,^trigrams,^fourgramsând^fivegrams^from^the^current^token^(ng).

Token,tokenshapeandpreviouspredictionfeaturesareplacedina5tokenwindow,namely,forthesethreefeatureswe alsoconsiderthepreviousandthenexttwowords,asshowninTable 3.

3.2. Gazetteers

We add gazetteers to oursystem only ifthey are readily available to use, butour approach doesnot fundamentally dependupon them.Weperforma look-upinagazetteerto checkifanamedentityoccursinthesentence.The resultof the look-up is representedwith the sameencoding chosen forthe training process, namely, the BIO orBILOU scheme.⁸ Thus,forthecurrenttokenweaddthefollowingfeatures:

1. Thecurrentnamedentityclassintheencodingschema.Thus,intheBILOUencodingwewouldhave“unit”,“beginning”,

“last”,“inside”,orifnotmatchisfound,“outside”,combinedwiththespeciﬁcnamedentitytype(LOC,ORG,PER,MISC, etc.).

2. Thecurrentnamedentityclassasaboveand thecurrenttoken.

3.3. Clusteringfeatures

Thegeneralideaisthatbyusingsometypeofsemanticsimilarityorwordclusterinducedoverlargeunlabeledcorpora it ispossible toimprove thepredictions forunseenwords inthe test set.Thistype of semi-supervisedlearning may be aimedatimprovingperformanceoveraﬁxedamountoftrainingdataor,givenaﬁxedtargetperformancelevel,toestablish howmuchsuperviseddataisactuallyrequiredtoreachsuchperformance[34].

Sofarthemostsuccessfulapproacheshaveonlyusedone typeofwordrepresentation[48,23,52].However, oursimple baseline combinedwithone type ofwordrepresentationfeatures are not abletocompete withprevious, morecomplex, systems.Thus,insteadofencodingmoreelaboratefeatures,wehavedevisedasimplemethodtocombine andstack various typesofclustering featuresinducedoverdifferentdatasourcesorcorpora.Inprinciple,ourmethodcanbeusedwithany typeofwordrepresentations.However,forcomparisonpurposes,wedecidedtousewordrepresentationspreviouslyusedin successfulNERCapproaches:Brownclusters[52,60],Word2vecclusters[48] andClarkclusters[24,23].Ascanbeobserved inTable 3,ourclusteringfeaturesareplacedina5tokenwindow.

3.3.1. Brownfeatures

TheBrownclusteringalgorithm[10]isahierarchicalalgorithmwhichclusterswordstomaximizethemutualinformation ofbigrams.Thus,itisaclass-basedbigrammodelinwhich:

• the probabilityofadocumentcorrespondstotheproductoftheprobabilitiesofitsbigrams,

• ^theprobabilityofeachbigramiscalculatedbymultiplyingtheprobabilityofabigrammodeloverlatentclassesbythe probabilityofeachclassgeneratingtheactualwordtypesinthebigram,and

• ^each^word^type^has^non-zeroprobabilityonlyonasingleclass.

TheBrownalgorithmtakesavocabulary ofwordstobeclusteredandacorpusoftext containingthesewords.Itstarts byassigningeachwordinthevocabularytoitsownseparatecluster,theniterativelymergesthepairofclusterswhichleads tothesmallestdecreaseinthelikelihoodofthetext corpus.Thisproducesahierarchicalclustering ofthewords,whichis usuallyrepresentedasabinarytree,asshowninFig. 1.Inthistreeeverywordisuniquelyidentiﬁedbyitspathfromthe root, andthepathcan berepresentedby abit string.Itisalsopossible tochoosedifferentlevels ofwordabstraction by choosingdifferentdepthsalongthepathfromtheroottotheword.Therefore,byusingpathsofvariouslengths,weobtain clusteringfeaturesofdifferentgranularities[40].

Weusepaths oflength 4,6,10and20asfeatures [52].However, we introduceseveralnoveltiesinthedesignofour Brownclusteringfeatures:

1. For each featurewhich istoken-based,we add afeature containing thepaths computedforthecurrenttoken. Thus, takingintoaccountourbaselinesystem,wewilladdthefollowingBrownclusteringfeatures:

8 TheBIOschemesuggeststolearnmodelsthatidentifytheBeginning,theInsideandtheOutsideofsequences.TheBILOUschemeproposestolearn modelsBeginning,theInsideandtheLasttokensofmulti-tokenchunksaswellasUnit-lengthchunks.

(8)

Fig. 1. A Brown clustering hierarchy.

(a) BrownToken:existingpathsoflength4,6,10and20forthecurrenttoken.

(b) BrownTokenShape:existingpathsoflength4,6,10,20forthecurrenttokenand currenttokenshape.

(c) BrownBigram:existingpathsoflength4,6,10,20forbigramsincludingthecurrenttoken.

2. Brownclusteringfeaturesbeneﬁtfromtwoadditionalfeatures:

(a) Previouspredictionplustoken:thepreviousprediction(pd)forthecurrenttokenand thecurrenttoken.

(b) Previoustwopredictions:thepreviouspredictionforthecurrentand theprevioustoken.

Forspacereasons,Table 3onlyshowstheBrownToken(bt) andBrownTokenShape(c)featuresforpaths oflength 4 and6.Weusethepubliclyavailabletool⁹ implementedbyLiang[36] withdefaultsettings.Theinputconsistsofacorpus tokenized andsegmentedone sentenceperline, withoutpunctuation.Furthermore,we followpreviouswork andremove allsentenceswhichconsistoflessthan90%lowercasecharacters[36,60]beforeinducingtheBrownclusters.

3.3.2. Clarkfeatures

Clark [15] presentsa number ofunsupervised algorithms, based on distributional andmorphological information,for clustering words into classes from unlabeled text. The focus is on clustering infrequent words on a small numbers of clustersfromcomparativelysmallamountsofdata.Inparticular,Clark[15]presentsanalgorithmcombiningdistributional informationwithmorphologicalinformationofwords“bycomposingtheNey–Essenclusteringmodelwithamodelforthe morphology within a Bayesian framework”. The objectiveis to bias the distributionalinformation to putwords that are morphologicallysimilarinthesamecluster.WeusethecodereleasedbyClark[15]offtheshelf¹⁰toinduceClarkclusters usingtheNey–Essenwithmorphologicalinformationmethod.Theinputofthealgorithmisasequenceoflowercasetokens withoutpunctuation,onetokenperlinewithsentencebreaks.

OurClark clusteringfeaturesare verysimple:we performalook-upofthecurrenttokenintheclusteringlexicon. Ifa match isfound,we add asa featuretheclustering class, orthelack ofmatchifthetoken isnot found (see Clark-aand Clark-binTable 3).

3.3.3. Word2vecfeatures

Anotherfamilyoflanguagemodelsthatproduceswordrepresentationsaretheneurallanguagemodels.Theseapproaches produce representationofwordsascontinuous vectors[18,43],alsocalledwordembeddings.Nowadays,perhaps themost popularamongthemistheSkip-gramalgorithm[39].TheSkip-gramalgorithmusesshallowlog-linearmodelstocompute vector representation ofwords whichare more eﬃcientthan previous wordrepresentations induced onneural language models.Theirobjectiveistoproducewordembeddingsbycomputingtheprobabilityofeachn-gramastheproductofthe conditionalprobabilitiesofeachcontextwordinthen-gramconditionedonitscentralword[39].

Instead of using continuous vectors as real numbers, we induce clusters or word classes from the word vectors by applyingK-meansclustering.Inthiswaywecanusetheclusterclassesassimplebinaryfeaturesbyinjectingunigrammatch features.WeusetheWord2vectool¹¹releasedbyMikolovetal.[39]witha5windowcontexttotrain50-dimensionalword embeddingsandtoobtainthewordclustersontopofthem.Theinputofthealgorithmisacorpustokenized,lowercased, withpunctuationremovedandinoneline.TheWord2vecfeaturesareimplementedexactlyliketheClarkfeatures.

9 https://github.com/percyliang/brown-cluster.

10 https://github.com/ninjin/clark_pos_induction.

11 https://code.google.com/p/Word2vec/.

(9)

Table 4

Unlabeledcorporausedtoinducedclusters.Foreachcorpusandclustertypethenumberofwords(inmillions)isspeciﬁed.Averagetrainingtimes:

dependingonthenumberofwords,Brownclusterstrainingtimerequiredbetween5 hand48 h.Word2vecrequired1–4hourswhereasClarkclusters traininglastedbetween5hoursand10days.

Millionwordsincorpus Million words for training

Brown Clark Word2vec

en Reuters RCV1 63 35 63 63

Wikipedia (20141208) 1700 790 790 1700

Gigaword 5th ed. 4000 – – 4000

de Wikipedia (20140725) 650 190 190 650

deWac[6] 1100 500 500 1100

es Wikipedia (20140810) 428 246 246 428

elperiodico (1998–2002) 60 35 60 60

Gigaword 3rd ed. 1150 330 (afp) 330 (afp) 1150

nl Wikipedia (20140804) 235 128 128 235

eu Wikipedia (20141208) 60 12 60 60

Egunkaria (1999–2003) 38 28 38 38

Berria (2003–2014) 90 78 90 90

3.3.4. Stackingandcombiningclusteringfeatures

Wesuccessfullycombineclusteringfeaturesfromdifferentwordrepresentations.Furthermore,wealsostack oraccumu- late featuresofthesametypeofwordrepresentationinduced fromdifferentdatasources, trustingeachclustering lexicon toadifferentdegree,asshownbytheﬁveencodedclusteringfeatures inTable 3:twoClark andWord2vecfeatures from differentsourcedataandoneBrownfeature.Whenusingwordrepresentationsassemi-supervisedfeaturesforatasklike NERC,twoprincipalfactorsneedtobetakenintoaccount:(i)thesourcedataorcorpususedtoinducethewordrepresen- tationsand(ii)theactualwordrepresentationusedtoencodeourfeatureswhichinturnmodifytheweightofourmodel’s parametersinthetrainingprocess.

Fortheclusteringfeaturestobeeffectivetheinducedclustersneedtocontainasmanywordsappearinginthetraining, development andtest setsaspossible.Thiscan beachievedby usingcorporacloselyrelatedtothetext genreordomain ofthedatasets orbyusingvery largeunlabeledcorporawhich,although notcloselydomain-related, belarge enoughto includemanyrelevantwords.Forexample,withrespecttotheCoNLL2003Englishdatasetanexampleoftheformerwould betheReuterscorpuswhiletheWikipediawouldbeanexampleofthelatter.

Thewordrepresentationsobtainedbydifferentalgorithmswouldcapturedifferentdistributionalpropertiesofwordsin a given corpusor data source.Therefore, each type of clustering would allowus to capturedifferent typesof occurring namedentitytypes.Inotherwords,combining andstacking differenttypesofclusteringfeaturesinducedovera varietyof datasourcesshouldhelptocapturemoresimilaritiesbetweendifferentwordsinthetrainingandtest sets,increasingthe contributiontotheweightsofthemodelparametersinthetrainingprocess.

4. Experimentalresults

InthisSectionwereportontheexperimentsperformedwiththeixa-pipe-nerc systemasdescribedintheprevioussec- tion.Theexperimentsareperformedin5languages:Basque,Dutch,English,GermanandSpanish.Forcomparisonpurposes, in-domainresultsarepresentedinSection4.1usingthemostcommonNERCdatasetsforeachlanguage assummarizedin Table 1. Section 4.2analyzesthe performance whenreducing training dataandSection 4.3presentseightout-of-domain evaluationsforthreelanguages:Dutch,EnglishandSpanish.

TheresultsforDutch,EnglishandSpanishdonotincludetrigramsandcharactern-gramsinthelocalfeaturesetdescribed inSection3.1,exceptforthemodelsineachin-domainevaluationwhicharemarkedwith“charngram1:6”.

Wealsoexperimentwithdictionaryfeaturesbut,incontrasttoprevious approachessuchasPassosetal.[48],weonly usecurrentlyavailablegazetteersoff-the-shelf.Forevery modelmarkedwith“dict” weusethethirtyEnglish IllinoisNER gazetteers [52], irrespective of the target language. Additionally, the English models use six gazetteers about the Global AutomotiveIndustryprovidedbyLexisNexistotheNewsreaderproject,¹²whereastheGermanmodelsinclude, inaddition totheIllinois gazetteers,theGermandictionariesdistributedintheCoNLL2003sharedtask.The gazetteersarecollapsed intoonelargedictionaryanddeployedasdescribedinSection3.2.

Finally,theclusteringfeaturesareobtainedbyprocessingthefollowingclustersfrompubliclyavailablecorpora:(i)1000 Brownclusters;(ii)ClarkandWord2vecclustersinthe100-600range.Tochoosethebestcombinationofclusteringfeatures wetesttheavailablepermutationsofClarkandWord2vecclusterswithandwithouttheBrownclustersonthedevelopment data.Table 4provides detailsofeverycorpususedtoinducetheclusters.Forexample,theﬁrstrowreads:“ReutersRCV1 wasused;theoriginal63millionwordswerereducedto35millionafterpre-processingforinducingBrownclusters.Clark

12 http://www.newsreader-project.eu.

(10)

Table 5

CoNLL2003Englishresults.

Features Development Test

P R F1 P R F1

Local (L) 93.02 87.75 90.31 87.27 81.32 84.19

L+Brown Reuters (BR) 92.83 89.33 91.05 90.28 86.79 88.50

L+Clark wiki 600 (CW600) 93.98 90.58 92.24 90.85 87.16 88.97

L+Word2vec giga 200 (W2VG200) 93.16 89.90 91.45 89.64 85.06 87.29

L+Word2vec wiki 400 (W2VW400) 93.22 90.02 91.59 88.98 85.09 86.99

L+^BR+^CW600+W2VW400 (light) 94.16 91.96 93.04 91.20 89.36 90.27

light+CR600+W2VG200 (comp) 94.32 92.22 93.26 91.75 89.64 90.69

comp+BW (best cluster) 94.21 92.23 93.26 91.67 89.98 90.82

comp+^dict ^94.60 ^92.78 ^93.68 ^91.86 ^90.53 ^91.19

BR+CR600-CW600+W2VG200+dict 94.58 92.53 93.54 92.20 90.19 91.18

charngram 1:6+en-91-18 94.56 92.81 93.68 92.16 90.56 91.36

Stanford NER (distsim-conll03) 93.64 92.27 92.95 89.37 87.95 88.65

Illinois NER – – 93.50 n/a n/a 90.57

Turian et al. (2010) 94.11 93.81 93.95 90.10 90.61 90.36

Passos et al. (2014) – – 94.46 – – 90.90

andWord2vecclusterswere trainedonthewholecorpus”.Thepre-processingandtokenizationisperformedwiththeIXA pipestools[1].

EveryevaluationiscarriedoutusingtheCoNLLNERevaluationscript.¹³TheresultsareobtainedwiththeBILOUencod- ingforeveryexperimentalsettingexceptforGermanCoNLL2003.

4.1. In-domainevaluation

In thissectionthe resultsarepresented bylanguage. Intwo cases,DutchandGerman,we usetwo differentdatasets, makingitatotalofsevenin-domainevaluations.

4.1.1. English

WetestedoursysteminthehighlycompetitiveCoNLL2003dataset.Table 5showsthatthreeofourmodelsoutperform previousbestresultsreportedforEnglishintheCoNLL2003dataset[48].NotethatthebestF1score(91.36)isobtainedby addingtrigramsandcharactern-gramfeaturestothebestmodel(91.18).Theresultsalsoshowthatthesemodelsimprove the baselineprovidedby thelocalfeaturesby around7points inF1 score.The mostsigniﬁcant gainisintermsofrecall, almost9pointsbetterthanthebaseline.

We alsoreportvery competitiveresults,onlymarginally lowerthan Passosetal.[48],based onthestacking andcom- bination of clustering features asdescribed inSection 3.3.4. Thus,both bestcluster and comp models,based onlocal plus clustering features only, outperform verycompetitive andmore complexsystemssuch asthose ofRatinovandRoth [52]

andTurianetal.[60].Thestacking andcombining effectmanifestsitselfveryclearlywhenwecomparethesingleclustering feature models (BR,CW600, W2VG200 andW2VW400) with the light,comp and bestcluster models which improve the overallF1scoreby1.30,1.72and1.85respectivelyoverthebestsingleclusteringmodel(CW600).

Itisworthmentioningthatourmodelsdonotscorebestinthedevelopmentdata.Asthedevelopmentdataiscloserin styleandgenretothetrainingdata[52],thismaysuggestthatoursystemgeneralizesbetterontestdatathatisnotclose tothetrainingdata;indeed,theresultsreportedinSection4.3seemtoconﬁrmthishypothesis.

WealsocomparedourresultswithrespecttothebesttwopubliclyavailableEnglishNERsystemstrainedonthesame data. We downloaded theStanford NER systemdistributed inthe 2015-01-30 package.We evaluated their CoNLL model and,whiletheresultissubstantiallybetterthantheirreferencepaper[24],ourclusteringmodelsobtainbetterresults.The IllinoisNERtaggerisusedbyRatinovandRoth[52]andTurianetal.[60],bothofwhichareoutperformedbyoursystem.

4.1.2. German

We tested our system in the GermEval 2014 dataset. Table 6 compares our results with the best two systems (ExB andUKP) bymeansoftheM3metric,whichseparately analyzestheperformanceintermsoftheouter andinner named entityspans.Table 6makesexplicitthesignificantimprovementsachievedbytheclusteringfeaturesontopofthebaseline system, particularlyin termsof recall(almost 11points in the outer level). The officialresults ofour best configuration (de-cluster-dict)arereportedinTable 7showingthatoursystemmarginallyimprovesthebestsystems’resultsonthattask (ExBandUKP).

Wealsocompareoursystem,inthelastthreerows,withthepubliclyavailableGermaNER[8],whichreportsresultsfor the4mainouterlevelentitytypes(person,location,organizationandother).Forthisexperimentwetrainedthede-cluster

13 http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt.

(11)

Table 6

GermEval2014M3metricresultsandcomparisontoGermaNERsystemontheouterspans.

Features Outer NEs Inner NEs

P R F1 P R F1

Local (L) 74.64 67.35 70.81 61.30 34.76 44.36

L+Brown wiki (BW) 79.71 73.20 76.31 63.82 37.67 47.37

L+Clark wiki 200 (CW200) 78.86 72.00 75.27 65.49 36.12 46.56

L+Word2vec deWac 100 (W2VWac100) 71.36 71.63 74.38 62.84 36.12 45.87

BW+^CW200+^W2V(W200+Wac100) (de-cluster) 81.00 75.14 77.96 62.80 40.00 48.87

de-cluster+^dict ^81.52 ^75.54 ^78.42 ^60.28 ^41.55 ^49.20

ExB[28] 80.67 77.55 79.08 45.20 41.17 43.09

UKP[53] 79.90 74.13 76.91 58.74 41.75 48.81

de-4-class-cluster 82.49 75.96 79.09 – – –

de-4-class-cluster+dict 82.70 76.44 79.45 – – –

GermaNER (4 classes) 82.72 71.19 76.52 – – –

Table 7

GermEval2014Oﬃcialresults.

Features Precision Recall F1

de-cluster+^dict ^80.28 ^72.93 ^76.43

ExB 78.07 74.75 76.38

UKP 79.54 71.10 75.09

Table 8

CoNLL2003Germanresults.

P R F1 P R F1

Local (L) 79.38 55.27 65.16 81.70 59.92 69.14

L+Brown deWac (BWac) 82.12 66.73 73.63 82.26 66.53 73.57

L+Clark deWac 500 (CWac500) 82.40 65.69 73.11 82.76 67.33 74.25

L+Word2vec deWac 100 (W2VWac100) 81.50 63.52 71.40 81.96 66.46 73.41

CWac500+W2VWac100 (de-cluster) 83.35 68.38 75.13 83.43 68.96 75.51

de-cluster+^dict ^85.20 ^70.54 ^77.18 ^83.72 ^70.30 ^76.42

Florian et al. (2003) 84.60 61.93 71.51 80.19 63.71 72.41

Faruqui and Padó (2010) 86.00 70.00 77.20 86.40 68.50 76.40

andde-cluster+dict modelsonthefourmainclasses,improvingGermaNER’sresultsbyalmost3F1points.TheGermaNER methodofevaluationisinterestingbecauseallows researcherstodirectlycomparetheir systemswitha publiclyavailable systemtrainedonGermEvaldata.

Table 8comparesourGermanCoNLL2003resultswiththebestpreviousworktrainedonpublicdata.OurbestCoNLL 2003modelobtainsresultssimilartothestateoftheartperformancewithrespecttothebestsystempublisheduptodate [23]usingpublicdata.Faruquietal.[23]alsoreport78.20 F1withamodeltrainedwithClark clustersinducedusingthe HugeGermanCorpus(HGC).Unfortunately,thecorpusortheinducedclusterswerenotavailable.

4.1.3. Spanish

ThebestsystemuptodateontheCoNLL2002dataset,originallypublishedbyCarrerasetal.[12],isdistributedaspart of theFreeling library [47]. Table 9lists four models that improveover their reported results,almost by 3points inF1 measureinthecaseofthees-cluster model(withourwithouttrigramandcharactern-gramfeatures).

4.1.4. Dutch

Despite using clusters from one data source only (see Table 4), results in Table 10 show that our nl-cluster model outperforms the best result published on CoNLL 2002 [16] by 3.83 points in F1 score. Adding the English Illinois NER gazetteers[52]andtrigramandcharactern-gramfeaturesincreasesthescoreto85.04F1,5.41pointsbetterthanprevious publishedworkonthisdataset.

We also compared our system with the more recentlydeveloped SONAR-1 corpus and the companion NERDsystem distributedinside its release[21]. Theyreport 84.91 F1forthe sixmainnamed entitytypesvia 10-foldcross validation.

Forthiscomparisonwechosethelocal,nl-cluster andnl-cluster-dict conﬁgurationsfromTable 10andrunthemonSONAR-1

(12)

Table 9

CoNLL2002Spanishresults.

P R F1 P R F1

Local (L) 77.78 75.30 76.52 79.49 79.54 79.52

L+Brown periodico (BP) 80.35 78.91 79.62 82.44 82.46 82.45

L+Clark giga 400 (CG400) 81.71 79.27 80.48 81.75 81.82 81.78

L+Clark wiki 400 (CW400) 80.84 78.75 79.78 81.07 81.00 81.03

L+Word2vec giga 400 (W2VG400) 80.04 77.96 78.99 81.56 81.76 81.67

BP+^C(W400+^G400)+W2VG400 (es-cluster) 81.87 80.61 81.23 84.18 84.15 84.16

charngram 1:6+es-cluster 81.95 80.24 81.09 84.27 84.01 84.14

Carreras et al. (2002) 79.15 77.80 78.47 81.38 81.40 81.39

Table 10

CoNLL2002Dutchresults.

P R F1 P R F1

Local (L) 75.66 70.95 73.23 78.97 74.80 76.83

L+Brown wiki (BW) 80.50 76.99 78.70 83.12 79.56 81.32

L+Clark wiki 400 (CW400) 79.44 75.92 77.64 83.18 80.21 81.67

BW+CW400+W2VW100 (nl-cluster) 82.16 79.20 80.65 84.44 82.49 83.46

nl-cluster+dict 83.63 80.66 82.12 85.92 83.00 84.43

charngram 1:6+nl-cluster 83.27 80.47 81.84 85.52 82.42 83.94

charngram 1:6+nl-cluster-dict 84.65 81.57 83.08 86.57 83.56 85.04

Carreras et al. (2002) 76.52 74.82 75.66 77.83 76.29 77.05

Curran and Clark (2003) – – – 79.91 79.35 79.63

Table 11

SONAR-110-foldcrossvalidationresults.

Features Precision Recall F1

Local (L) 86.66 85.57 86.11

nl-cluster 87.89 87.56 87.72

nl-cluster+^dict ^88.08 ^87.91 ^88.00

Sonar-nerd – – 84.91

Table 12

BasqueEgunkariaresults.

Features P R F1

Local 70.52 60.27 65.00

L+Brown egunkaria (BE) 74.54 67.59 70.90

L+Clark egunkaria 200 (CE200) 76.76 68.92 72.63

L+Clark wiki 200 (CW200) 75.57 65.60 70.23

L+Word2vec egunkaria 300 (W2VE300) 74.04 62.71 67.91

L+Word2vec berria 600 (W2WB600) 74.11 64.82 69.15

BE+C(EW)200+W2V(E300+B600) (eu-cluster) 80.66 73.14 76.72

eu-cluster (4 classes) 80.66 70.78 75.40

Alegria et al. (2006) 72.50 70.24 71.35

using thesame settings.The results reportedinTable 11 showsour system’simprovementover previous results onthis dataset.

4.1.5. Basque

Table 12reportsontheexperimentsusingtheEgunkaria NERdatasetprovidedbyAlegriaetal.[4].Duetothesparsity oftheMISCclassmentionedinSection2.1,wedecidedtotrainourmodelsonthreeclassesonly(location,organizationand person).Thus,theresultsareobtainedtrainingourmodelsinthecustomarymannerandevaluatingon3classes.However, fordirectcomparisonwithpreviouswork[4],wealsoevaluateourbesteu-cluster model(trainedon3classes)on4classes.

The results show that our eu-cluster model clearly improves upon previous work by 4 points in F1 measure (75.40 vs 71.35).These results are particularly interesting as it had been so far assumedthat complex linguistic features and language-speciﬁc rules were requiredto perform well foragglutinative languagessuch asBasque[4]. Finally,it isworth