Automatic speech recognizers for Mexican Spanish and its open resources Technology Journal of Applied Researchand

(1)

Availableonlineatwww.sciencedirect.com

Journal of Applied Research and Technology

www.jart.ccadet.unam.mx JournalofAppliedResearchandTechnology15(2017)259–270

Original

Automatic speech recognizers for Mexican Spanish and its open resources

Carlos Daniel Hernández-Mena

^a,∗

, Ivan V. Meza-Ruiz

^b

, José Abel Herrera-Camacho

^a

aLaboratoriodeTecnologíasdelLenguaje(LTL),UniversidadNacionalAutónomadeMéxico(UNAM),Mexico

bInstitutodeInvestigacionesenMatemáticasAplicadasyenSistemas(IIMAS),UniversidadNacionalAutónomadeMéxico(UNAM),Mexico Received21March2016;accepted9February2017

Availableonline28April2017

Abstract

Developmentof automaticspeechrecognitionsystemsrelieson the availabilityofdistinctlanguageresourcessuchas speechrecordings, pronunciationdictionaries,andlanguagemodels.TheseresourcesarescarcefortheMexicanSpanishdialect.Inthiswork,wepresentarevisionof theCIEMPIESScorpusthatisaresourceforspontaneousspeechrecognitioninMexicanSpanishofCentralMexico.Itconsistsof17hofsegmented andtranscribedrecordings,aphoneticdictionarycomposedby53,169uniquewords,andalanguagemodelcomposedby1,505,491wordsextracted from2489universitynewsletters.WealsoevaluatetheCIEMPIESScorpususingthreewellknownstateoftheartspeechrecognitionengines, havingsatisfactoryresults.Theseresourcesareopenforresearchanddevelopmentinthefield.Additionally,wepresentthemethodologyandthe toolsusedtofacilitatethecreationoftheseresourceswhichcanbeeasilyadaptedtoothervariantsofSpanish,orevenotherlanguages.

Keywords:Automaticspeechrecognition;MexicanSpanish;Languageresources;Languagemodel;Acousticmodel

1. Introduction

Current advances in automatic speech recognition (ASR) havebeenpossiblegiventheavailablespeechresourcessuchas speechrecordings,orthographictranscriptions,phoneticalpha- bets,pronunciation dictionaries, largecollections of text and computationalsoftwarefor the constructionof ASRsystems.

However,theavailabilityoftheseresourcesvariesfromlanguage tolanguage.Untilrecently,thecreationofsuchresourceshas beenlargelyfocusedonEnglish.Thishashadapositiveeffecton thedevelopmentofresearchofthefieldandspeechtechnology forthislanguage.Thiseffecthasbeensopositivethattheinfor- mationandprocesseshavebeentransferredtootherlanguages sothatnowadaysthe mostsuccessfulrecognizersforSpanish languagearenotcreatedinSpanish-speakingcountries.Further- more,recentdevelopmentintheASRfieldreliesonrestricted corporawithrestrictedaccessor notaccessatall.Inorderto make progress in the study of spoken Spanish and take full

∗Correspondingauthor.

PeerReviewundertheresponsibilityofUniversidadNacionalAutónomade México.

advantage of theASR technology,we considerthat agreater amountofresourcesforSpanishneedstobefreelyavailableto theresearchcommunityandindustry.

Withthisinmind,wepresentamethodologyandresources associatedtoitfortheconstructionof ASRsystemsfor Mex- ican Spanish; we arguethat withminimaladaptations tothis methodology,itispossibletocreateresourcesforothervariants ofSpanishorevenotherlanguages.

Themethodologythatweproposefocusesonfacilitatingthe collectionoftheexamplesnecessariesforthecreationofanASR systemandtheautomaticconstructionofpronunciationdictio- naries.Thismethodologyhasbeenconcludedontwocollections thatwepresentinthiswork.Thefirstisthelargestcollectionof recordingsandtranscriptionsforMexicanSpanishfreelyavail- able forresearch,andthe second isalargecollection of text extractedfromauniversitymagazine.Thefirstcollectionwas collectedandtranscribedinaperiodoftwoyearsanditisuti- lizedtocreateacousticmodels.Thesecondcollectionisusedto createalanguagemodel.

We also present our system for the automatic generation of phonetic transcriptions of words. This system allows the creationofpronunciationdictionaries.Inparticular,thesetran- scriptions are basedonthe MEXBET (Cuetara-Priede, 2004)

http://dx.doi.org/10.1016/j.jart.2017.02.001

1665-6423/©2017UniversidadNacionalAutónomadeMéxico,CentrodeCienciasAplicadasyDesarrolloTecnológico.Thisisanopenaccessarticleunderthe CCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

Table1

ASRCorporaforthetopfivespokenlanguagesintheworld.

Rank Language LDC ELRA Examples

1 Mandarin 24 6 TDT3(Graff,2001)

TC-STAR2005(TC-STAR,2006)

2 English 116 23 TIMIT(Garofolo,1993)

CLEFQAST(CLEF,2012)

3 Spanish 20 20 CALLHOME(Canavan&Zipperlen,1996)

TC-STARSpanish(TC-STAR,2007)

4 Hindi 9 4 OGIMultilanguage(Cole&Muthusamy,1994)

LILA(LILA,2012)

5 Arabic 10 32 WestPointArabic(LaRocca&Chouairi,2002)

NetDC(NetDC,2007)

phoneticalphabet,awellestablishalphabetforMexicanSpan- ish. Together, these resources are combined to create ASR systemsbased onthreefreelyavailablesoftware frameworks:

Sphinx, HTKandKaldi.The finalrecognizers areevaluated, compared,andmadeavailabletobeusedforresearchpurposes ortobeintegratedinSpanishspeechenabledsystems.

Finally,wepresentthecreationoftheCIEMPIESScorpus.

TheCIEMPIESS(Hernández-Mena,2015;Hernández-Mena&

Herrera-Camacho,2014)corpuswasdesignedtobeusedinthe field of the automatic speech recognition and we utilize our experience inthe creation of it as a concreteexample of the wholemethodologythatwepresentatthispaper.Thatiswhy theCIEMPIESSwillbeembeddedinallourexplanationsand examples.

Thepaperhasthefollowingoutline:InSection2wepresent arevisionofcorporaavailableforautomaticspeechrecognition inSpanishandMexicanSpanish,inSection3wepresenthowan acousticmodeliscreatedfromaudiorecordingsandtheirortho- graphictranscriptions.InSection4weexplainhowtogeneratea pronunciationdictionaryusingourautomatictools.InSection5 weshowhowtocreatealanguagemodel,Section6showshow we evaluatedthe databaseinareal ASRsystemandhowwe validatetheautomatictoolswearepresentingatthispaper.At theend,inSection7,wediscussourfinalconclusions.

2. Spanishlanguageresources

Accordingtothe“Anuario2013”¹ createdbythe Instituto Cervantes²andthe“Atlasdelalenguaespa˜nolaenelmundo”

(Moreno-Fernández&Otero,2007)theSpanishlanguageisone ofthetopfivemorespokenlanguagesintheworld.Actually,the InstitutoCervantesmakesthefollowingremarks:

• Spanishisthesecondmorespokennativelanguagejustbehind MandarinChinese.

• Spanishisthesecondlanguageforinternationalcommunica- tion.

1 Available for web download at: http://cvc.cervantes.es/lengua/anuario/

anuario13/(August2015).

2 TheInstitutoCervantes(http://www.cervantes.es/).

• Spanishisspokenbymorethan500millionspersons,includ- ing speakers who use it as a native language or a second language.

• Itisprojectedthatin2030,7.5%ofthepeopleoftheworld willspeakSpanish.

• MexicoisthecountrywiththemostSpanishspeakersamong Spanishspeakingcountries.

Thisspeakstothe importanceof Spanish forspeechtech- nologieswhichcanbecorroboratedbytheamountofavailable resources forASRinSpanish.ThiscanbenoticedinTable1 thatsummarizestheamountofASRresourcesavailableinthe LinguisticDataConsortium³(LDC)andintheEuropeanLan- guage Resources Association⁴ (ELRA)for the topfive more spoken languages.⁵Asonecan seethe resources forEnglish areabundantcomparedtotherestof thetoplanguages.How- ever, intheparticularcaseoftheSpanishlanguage,thereisa goodamountofresourcesreportedinthesedatabases.Besides additionalresourcesforSpanishcanbefoundinothersources such as:reviewsinthe field(Llisterri,2004;Raab,Gruhn,&

Noeth,2007),the“LREMap”,⁶andproceedingsofspecialized conferences,suchasLREC(Calzolarietal.,2014).

2.1. MexicanSpanishresources

In theprevious section, onecan noticethat there are several optionsavailablefortheSpanishlanguage,butwhenone focuses ina dialect such as Mexican Spanish, the resources are scarcer.In theliteratureonecanfindseveralarticles ded- icated to the creation of speech resources for the Mexican Spanish,(Kirschning,2001;Olguín-Espinoza,Mayorga-Ortiz, Hidalgo-Silva, Vizcarra-Corral, & Mendiola-Cárdenas, 2013;

Uraga & Gamboa, 2004). However, researchers usually create small databases todo experiments so onehas to contact theauthorsanddependontheirgoodwilltogetacopyofthe resource(Audhkhasi,Georgiou,&Narayanan,2011;deLuna Ortega, Mora-González,Martínez-Romo, Luna-Rosas, &Mu

3https://www.ldc.upenn.edu/.

4http://www.elra.info/en/.

5Formoredetailsontheresourcesperlanguagevisit:http://www.ciempiess.

org/corpus/CorpusforASR.html.

6http://www.resourcebook.eu/searchll.php.

(3)

Table2

CorpusforASRthatincludetheMexicanSpanishlanguage.

Name Size Dialect Datasources Availability

DIMEx100(Pineda,Pineda,Cuétara, Castellanos,&López,2004)

6.1h MexicanSpanishofCentral Mexico

ReadUtterances FreeOpenLicense

1997SpanishBroadcastNews SpeechHUB4-NE(Others,1998)

30h IncludesMexicanSpanish BroadcastNews Since2015

LDC98S74$400.00 USD

1997HUB4BroadcastNews EvaluationNon-EnglishTest Material(Fiscus,2001)

1h IncludesMexicanSpanish BroadcastNews LDC2001S91

$150.00USD

LATINO-40(Bernstein,1995) 6.8h SeveralcountriesofLatin AmericaincludingMexico

MicrophoneSpeech LDC95S28$1000.00 USD

WestPointHeroicoSpanishSpeech (Morgan,2006)

16.6h IncludesMexicanSpanishof CentralMexico

MicrophoneSpeech (read)

LDC2006S37

$500.00USD FisherSpanishSpeech(Graff,2010) 163h Caribbeanandnon-Caribbean

Spanish(includingMexico)

TelephoneConversations LDC2010S01

$2500.00USD Hispanic–EnglishDatabase(Byrne,

2014)

30h SpeakersfromCentraland SouthAmerica

MicrophoneSpeech (conversationalandread speech)

LDC2014S05

$1500.00USD

noz-Maciel,2014;Moya, Hernández,Pineda, &Meza,2011;

Varela,Cuayáhuitl,&Nolazco-Flores,2003).

Eventhoughthe resourcesinMexicanSpanish arescarce, we identified7 corporawhichare easilyavailable. Theseare presentedinTable2.Asshowninthetable,onehastopayin ordertohaveaccesstomostoftheresources.Thenotableexcep- tionistheDIMEx100corpus(Pinedaetal.,2010)whichwas recentlymadeavailable.⁷Theproblemwiththisresourceisthat thecorpusiscomposedforreadingmaterialanditisonly6h longwhichlimitsthetypeofacousticphenomenapresent.This imposesalimit on theperformanceof the speechrecognizer createdutilizingthisresource(Moyaetal.,2011).Inthiswork wepresentthecreationoftheCIEMPIESScorpusanditsopen resources.The CIEMPIESS corpus consistsof 17h of recor- dingsofCentralMexicoSpanishbroadcastofinterviewswhich providesspontaneousspeech.Thismakesitagoodcandidate forthecreationofspeechrecognizers.Besidesthis,weexpose themethodologyandtoolscreatedfortheharvestingofthedif- ferentaspectsofthecorpussothesecanbereplicatedinother underrepresenteddialectsofSpanish.

3. Acousticmodeling

For several decades, ASR technology has relied on the machinelearningapproach.In thisapproach,examplesofthe phenomenon are learned. A model resulting from the learn- ingis createdandlater usedto predict suchphenomenon. In anASRsystem,therearetwosourcesofexamplesneededfor itsconstruction.Thefirstoneisacollectionofrecordingsand theircorrespondingtranscriptions.Theseareusedtomodelthe relationbetweensoundandphonemes.Theresultingmodelis usuallyreferredas theacousticmodel.Thesecond sourceof examplesofsentencesinalanguagewhichisusuallyobtained

7 For web downloading at: http://turing.iimas.unam.mx/luis/DIME/

CORPUS-DIMEX.html.

Pronouncing dictionary

al al de de por por(

Language model

ASR System

Speech

Acoustic models

Text...

Text

Fig.1.Componentsandmodelsforautomaticspeechrecognition.

fromalargecollectionoftexts.Theseareusedtolearnamodel ofhowphrasesarebuiltbyasequenceofwords.Theresulting modelisusuallyreferredasthelanguagemodel.Additionally, anASRsystemusesadictionaryofpronunciationstolinkboth theacousticandthelanguagemodels,sincethiscaptureshow phonemescomposewords.Fig.1illustratestheseelementsand howtheyrelatetoeachother.

Fig.2showsindetailthefullprocessandtheelementsneeded to createacoustic models.First of all,recordings of the cor- puspassthroughafeatureextractionmodulewhichcalculates thespectralinformationoftheincomingrecordingsandtrans- formsthemintoaformatthetrainingmodulecanhandle.Alist ofphonemesmustbeprovidedtothesystem.Thetaskoffill- ingeverymodelwithstatisticinformationisperformedbythe trainingprocess.

(4)

Pronouncing dictionary

tS a r(

al al de de por por(

Training module

Feature extraction Recordings

&

transcriptions

Acoustic models

Training results

Corpus List of phonemes

Fig.2.Architectureofa training moduleforautomaticspeechrecognition systems.

Table3

CIEMPIESSoriginalaudiofileproperties.

Description Properties

Numberofsourcefiles 43

Formatofsourcefiles mp3/44.1kHz/128kbps Durationofallthefilestogether 41h21min

Durationofthelongestfile 1h26min41s Durationoftheshortestfile 43min12s Numberofdifferentradioshows 6

3.1. Audiocollection

TheoriginalsourceofrecordingsusedtocreatetheCIEM- PIESS corpus comes from radio interviews in the format of a podcast.⁸ We chose this source because they were easily available,theyhadseveralspeakerswiththeaccentofCentral Mexico, andallof themwere freelyspeaking. Thefiles sum togetheratotalof43one-hourepisodes.⁹Table3summarizes themaincharacteristics ofthe originalversionof theseaudio files.

3.2. Segmentationofutterances

Fromtheoriginal recordings,thesegmentsof speechneed to be identified. To define a “good” segment of speech, the followingcriteriawastakenintoaccount:

• Segmentswithauniquespeaker.

• Segmentscorrespondtoanutterance.

• Thereshouldnotbemusicinthebackground.

8 Originallytransmittedby:“RADIOIUS”(http://www.derecho.unam.mx/

cultura-juridica/radio.php)andavailableforwebdownloadingat:PODSCAT- UNAM(http://podcast.unam.mx/).

9 For moredetails visit: http://www.ciempiess.org/CIEMPIESSStatistics.

html#Tabla2.

Table4

CharacteristicsoftheutterancerecordingsoftheCIEMPIESScorpus.

Characteristic Training Test

Numberofutterances 16,017 1000

Totalofwordsandlabels 215,271 4988

Numberofwordswithnorepetition 12,105 1177

Numberofrecordings 16,017 1000

Totalamountoftimeofthetrainset (hours)

17.22 0.57

Averagedurationperrecording (seconds)

3.87 2.085

Durationofthelongestrecording (seconds)

56.68 10.28

Durationoftheshortestrecording (seconds)

0.23 0.38

Averageofwordsperutterance 13.4 4.988

Maximumnumberofwordsinan utterance

182 37

Minimumnumberofwordsinan utterance

2 2

• Thebackgroundnoiseshouldbeminimum.

• Thespeakershouldnotbewhispering.

• The speaker should not have an accent otherthan Central Mexico.

Attheend,16,717utteranceswereidentified.Thisisequiv- alentto17hofonly“clean”speechaudio.78%ofthesegments comefrommalespeakersand22%fromfemalespeakers.¹⁰This genderimbalanceisnotuncommoninothercorpora(forexam- pleseeFederico,Giordani,&Coletti,2000;Wang,Chen,Kuo,

&Cheng,2005sincegenderbalancingisnotalwayspossibleas inLangmann,Haeb-Umbach,Boves,&denOs,1996;Larcher, Lee,Ma,&Li,2012).Thesegmentsweredividedintotwosets:

training (16,017)andtest (700),the test setwas additionally complemented by300 utterancesfrom differentsources such as:interviews,broadcastnewsandreadspeech.Weaddedthese 300utterancesfromalittlecorpusthatbelongtoourlaboratory, to performprivateexperiments thatare important tosomeof ourstudents.Table4summarizesthemaincharacteristicsofthe utterancerecordingsoftheCIEMPIESScorpus.¹¹

Theaudiooftheutteranceswasstandardizedintorecordings sampledat16kHz,with16-bitsinaNISTSpherePCMmono formatwithanoiseremovalfilteringwhennecessary.

3.3. Orthographictranscriptionofutterances

Inadditiontotheaudiocollectionofutterances,itwasneces- sarytohavetheirorthographictranscriptions.Inordertocreate these transcriptions we followed these guidelines: First, the process beginswithacanonical orthographic transcriptionof everyutteranceinthe corpus.Later,thesetranscriptionswere enhancedtomarkcertainphenomenaoftheSpanishlanguage.

10Toseeatablethatshowswhichsentencesbelongtoaparticularspeaker, visit:http://www.ciempiess.org/CIEMPIESSStatistics.html#Tabla8.

11Formoredetails,seethischartat:http://www.ciempiess.org/CIEMPIESS Statistics.html#Tabla1.

(5)

Theconsiderationswetookintoaccountfortheenhancedver- sionwere:

• Donotusecapitalizationandpunctuation

• Expandabbreviations(e.g.theabbreviation“PRD”waswrit- tenas:“peerrede”).

• Numbersmustbewrittenorthographically.

• Specialcharacterswereintroduced(e.g.Nfor ˜n,Wforu,¨ $ whenletter“x”soundslikephoneme/s/,S whenletter“x”

soundslikephoneme/ʃ/).

• Theletter“x”hasmultiplephonemesassociatedtoit.

• Wewrotethetonicvowelofeverywordinuppercase.

• Wemarkedthesilencesanddisfluencies.

3.4. Enhancedtranscription

Aswe have mentioned inthe previous section, the ortho- graphictranscriptionwasenhancedbyaddinginformationofthe pronunciationofthewordsandthedisfluenciesinthespeech.

Therestofthissectionpresentssuchenhancements.

Mostof thewords inSpanish containenough information abouthow topronounce them,however thereare exceptions.

Inordertofacilitatetheautomaticcreationofadictionary,we addedinformation of the pronunciation of the x’s whichhas severalpronunciationintheMexicanSpanish.Annotatorswere askedtoreplace x’sbyan approximation ofits possiblepro- nunciations.Doingthis,weeliminatetheneedforanexception dictionary.Table5exemplifiessuchcases.

Anothermarkweannotateintheorthographictranscription istheindicationofthetonicvowelinaword.InSpanish,atonic vowelisusuallyidentifiedbyaraiseinthepitch;sometimesthis vowelisexplicitlymarkedintheorthographyofthewordbyan acutesign(e.g.“acción”,“vivía”).Inordertomakethisdiffer- ence explicitamongsounds, theenhanced transcriptions also markedassuchinboththeexplicitandimplicitcases(Table6 exemplifiesthisconsideration).Wedidthisfortworeasons,the mostimportantisthatwewanttoexploretheeffectoftonicand non-tonicvowelsinthespeechrecognition,andtheotherone isthatsomesoftwaretoolscreatedforHTKorSPHINXdonot managecharacterswithacutesignsproperly,sothebestthing todoistouseonlyASCIIsymbols.

Finally,silencesanddisfluenciesweremarkedfollowingthis procedure:

Table5

Enhancementfor“x”letter.

Canonicaltranscriptions Phoneme equivalence (IPA)

Transcription mark

Enhanced transcription

sexto,oxígeno /ks/ KS sEKSto,

oKSIgeno

xochimilco,xilófono /s/ $ $ochimIlco,

$ilOfono

xolos,xicoténcatl /ʃ/ S SOlos,

SicotEncatl

ximena,xavier /x/ J JimEna,

JaviEr

Table6

Exampleoftonicmarksinenhancedtranscription.

Canonical Enhanced Canonical Enhanced

ambulancia ambulAncia perla pErla

química quImica aglutinado aglutinAdo

ni˜no nINo ejemplar ejemplAr

ping˜nino pingWIno tobogán tobogAn

Table7

Exampleofenhancetranscriptions.

Original

<s>apartirdela˜nomilnovecientosnoventaysiete</s>(S1)

<s>esunaformadeexpresiondelossentimientos</s>(S2) Enhanced

<s><sil>ApartIrdElANo<sil>mIlnoveciEntos++dis++novEnta ysiEte<sil></s>(S1)

<s><sil>Es++dis++UnafOrmadEeKSpresiOn<sil>dElOs sentimiEntos<sil></s>(S2)

• Anautomaticanduniformalignmentwasproducedusingthe wordsinthetranscriptionsandtheutteranceaudios.

• Annotators were askedtoalignthe words withtheir audio usingthePRAATsystem(Boersma&Weenink,2013).

• Whentherewasspeechwhichdidnotcorrespondtotheaudio, the annotators were asked toanalyzeif it was a caseof a disfluency. If positive, they were asked to mark it with a ++dis++.

• Theannotatorswerealsoasktomarkevidentsilencesinthe speechwitha<sil>.

Table7comparesacanonicaltranscriptionanditsenhanced version.

Bothsegmentation andorthographic transcriptionsintheir canonicalandenhanceversion(wordalignment)areverytime consumingprocesses.Inordertotranscribeandalignthefull utterances,we collaboratedwith20collegestudentsinvesting 480heachtotheprojectovertwoyears.Threeofthemmade theselectionofutterancesfromtheoriginalaudiofilesusingthe Audacitytool(Team,2012)andtheycreatedtheorthographic transcriptionsinaperiodofsixmonths,attherateofonehourper week.Therestofcollaboratorsspentoneandahalfyearaligning thewordtranscriptionswiththeutterancesfordetectingsilences anddisfluencies.Thislaststepimpliesthatorthographic tran- scriptionswerecheckedatleasttwicebytwodifferentperson.

Table8showsachronographofeachtaskdoneperhalfyear.

Table8

Numberofstudentsworkingpersemesterandtheirlabors.

Semester Students Labor

1st 3 Audioselectionand

orthographictranscriptions

2nd 6 Wordalignment

3rd 6 Wordalignment

4th 5 Wordalignment

(6)

Table9

WordfrequencybetweenCIEMPIESSandCREAcorpus.

No. WordsinCREA Norm.freq.

CREA

Norm.freq.

CIEMPIESS

No. WordsinCREA Norm.freq.CREA Norm.freq.

CIEMPIESS

1 de 0.065 0.056 11 las 0.011 0.008

2 la 0.041 0.033 12 un 0.010 0.011

3 que 0.030 0.051 13 por 0.010 0.008

4 el 0.029 0.026 14 con 0.009 0.006

5 en 0.027 0.025 15 no 0.009 0.015

6 y 0.027 0.022 16 una 0.008 0.010

7 a 0.021 0.026 17 su 0.007 0.003

8 los 0.017 0.014 18 para 0.006 0.008

9 se 0.013 0.014 19 es 0.006 0.017

10 del 0.012 0.008 20 al 0.006 0.004

3.5. Distributionofwords

In ordertoverify that the distributionof words inCIEM- PIESScorrespondstothedistributionofwordsintheSpanish language,wecomparethedistributionofthefunctionalwordsin thecorpusversusthe“CorpusdeReferenciadelEspa˜nolActual”

(ReferenceCorpusofCurrentSpanish,CREA).¹² The CREA corpusisconformedby140,000textdocuments.Thatis,more than154millionofwordsextractedfrombooks(49%),news- papers(49%),andmiscellaneoussources(2%).Italsohasmore than700,000wordtokens.Table9illustratestheminordiffer- ences between frequenciesof the 20most frequent words in CREAandtheirfrequencyinCIEMPIESS.¹³ Asonecansee, thedistributionoffunctionalwordsisproportionalbetweenboth corpora.

Wealsocalculatethemeansquareerror(MSE=9.3×10⁻⁸) ofthenormalizedfrequenciesofthewordsbetweentheCREA andthewholeCIEMPIESSsowefoundthatitislowandthe correlationofthetwodistributionsis0.95.Ourinterpretationof thisisthatthedistributionofwordsintheCIEMPIESSreflects thedistributionofwordsintheSpanishlanguage.Wearguethat thisisrelevantbecausetheCIEMPIESSisthenagoodsample thatreflectswellthebehaviorofthelanguagethatitpretendsto model.

4. Pronouncingdictionaries

The examples of audio utterances and their transcriptions alone are not enough to start the training procedure of the ASRsystem.Inordertolearnhowthebasicsoundsof alan- guagesound, itisnecessarytotranslatewordsintosequences of phonemes. This information is codified in the pronuncia- tiondictionary,whichproposesoneormorepronunciationfor eachword. Thesepronunciationsaredescribedas asequence ofphonemesforwhichacanonicalsetofphonemeshastobe decided. In the creation of the CIEMPIESS corpus,we pro-

12See:http://www.rae.es/recursos/banco-de-datos/crea-escrito.

13DownloadthewordfrequenciesofthewordsofCREAfrom:http://corpus.

rae.es/lfrecuencias.html.

posedtheautomaticextractionofpronunciationsbasedonthe enhancedtranscriptions.

4.1. Phoneticalphabet

In thisworkweused theMEXBET phoneticalphabetthat hasbeenproposedtoencodethephonemesandtheallophones of MexicanSpanish(Cuetara-Priede, 2004;Hernández-Mena, Martínez-Gómez,&Herrera-Camacho,2014;Uraga&Pineda, 2000).MEXBETisaheritageofourUniversityandithasbeen successfullyused overtheyearsinseveralarticlesandthesis.

Nevertheless,thebestreasonforchoosingMEXBETisthatthis isthemostupdatedphoneticalphabetfortheMexicanSpanish dialect.

Thisalphabethasthreelevelsofgranularityfromthephono- logical(T22)tothephonetic(T44andT54).¹⁴Forthepurpose oftheCIEMPIESScorpusweextendedtheT22andT54levels towhatwecallT29andT66.¹⁵InthecaseofT29thesearethe mainchanges:

• ForT29weaddedthephoneme/tl/asiniztaccíhuatl→/is.

tak. si.ua.tl/→[ .tak. si.wa.tl].Eventhough thecountsofthephoneme/tl/aresolowintheCIEMPIESS corpus,we decided toinclude itintoMEXBET becuasein Mexico,manypropernamesofplacesneedittohaveacorrect phonetictranscription.

• ForT29weaddedthephoneme/S/asinxolos→/ ʃo.los /→[ʃo.l s]

• ForT29we consideredthesymbols/a7/, /e7/,/i 7/,/o7/

and/u7/of the levelsT44 andT54used toindicate tonic vowelsinwordtranscriptions.

All these changeswere motivated by the need toproduce moreaccuratephonetictranscriptionsfollowingtheanalysisof

14For more detail on the different levels and the evolution of MEX- BET through time see the charts in: http://www.ciempiess.org/Alfabetos Foneticos/EVOLUTIONofMEXBET.html.

15Inourpreviouspapers,werefertothelevelT29asT22andthelevelT66as T50butthisisincorrectbecausethenumber“22”or“44”,etc.mustreflectthe numberofphonemesandallophonesconsideredinthatlevelofMEXBET.

(7)

Table10

ComparisonbetweenMEXBETT66fortheCIEMPIESS(leftorinbold)andDIMEx100T54(right)databases.

Consonants Labial Labiodental Dental Alveolar Palatal Velar

UnvoicedStops p:p/pc t:t/tc kj:kj/kc k:k/kc

VoicedStops b:b/b c d:d/dc g:g/gc

UnvoicedAffricate tS:tS/tSc

VoicedAffricate dZ:dZ/dZc

UnvoicedFricative f s[ s S x

VoicedFricative V Dz[ z Z G

Nasal m M n[ n njn∼ N

VoicedLateral l[ l tllj

VoicelessLateral l0

Rhotic r(r

VoicelessRhotic r(0r(\

Vowels Palatal Central Velar

Semi-consonants j w

i( u(

Close i u

I U

Mid e o

E O

Open aj a a2

TonicVowels Palatal Central Velar

Semi-consonants j7 w7

i( 7 u( 7

Close i7 u7

I7 U7

Mid e7 o7

E7 O7

Open aj7 a7 a27

Table11

ExampleoftranscriptionsinIPAagainsttranscriptionsinMEXBET.

Word PhonologicalIPA PhoneticIPA

ineptitud i.nep.ti. tud i.n p.ti. t indulgencia in.dul. xen.sia .d l. xen.sja institución ins.ti.tu. sion n .ti.tu. sj n

MEXBETT29 MEXBETT66

ineptitud ineptitu7d inEptitU7D indulgencia indulxe7nsia In[dUlxe7nsja institución institusio7n Ins[titusiO7n

Cuetara-Priede(2004).Table10illustratesthemaindifference betweenT54andT66levelsoftheMEXBETalphabets.

Table11showsexamples ofdifferent Spanishwords tran- scribedusingthesymbolsoftheInternationalPhoneticAlphabet (IPA)againstthesymbolsofMEXBET.¹⁶

Table 12 shows the distribution of the phonemes in the automatically generated dictionary and compares it with the

16 To see the equivalences between IPA and MEXBET symbols see:

http://www.ciempiess.org/AlfabetosFoneticos/EVOLUTIONofMEXBET.

html#Tabla5.

DIMEx100corpus.Weobservethatbothcorporashareasimilar distribution.¹⁷

4.2. Characteristicsofthepronouncingdictionaries

Thepronunciationdictionaryconsistsofalistofwordsfor the target language and their pronunciation at the phonetic or phonological level. Based on the enhanced transcription, we automatically transcribethe pronunciation for eachword.

Forthis,wefollowedtherulesfromCuetara-Priede(2004)and Hernández-Menaet al.(2014). Theproduceddictionarycon- sistsof 53,169 words.Table13 showssomeexamples of the automaticallycreatedtranscriptions.

Theautomatictranscriptionwasdoneusingthefonetica2¹⁸ library,thatincludestranscriptionroutinesbasedonrulesforthe T29andtheT66levelsofMEXBET.Thislibraryimplements thefollowingfunctions:

• vocal tonica():Returnsthesameincomingwordbutwithits tonicvowelinuppercase(e.g.cAsa,pErro,gAto,etc.).

17ToseeasimilartablewhichshowsdistributionsoftheT66levelofCIEM- PIESS,seehttp://www.ciempiess.org/CIEMPIESSStatistics.html#Tabla6.

18Availableathttp://www.ciempiess.org/downloads,andforademonstration, gotohttp://www.ciempiess.org/tools.

(8)

Table12

PhonemedistributionoftheT29leveloftheCIEMPIESScomparedtotheT22levelofDIMEx100corpus.

No. Phoneme InstancesDIMEx100 PercentageDIMEx100 InstancesCIEMPIESS PercentageCIEMPIESS

1 p 6730 2.42 19,628 2.80

2 t 12,246 4.77 35,646 5.10

3 k 8464 3.30 29,649 4.24

4 b 1303 0.51 15,361 2.19

5 d 3881 1.51 34,443 4.92

6 g 426 0.17 5496 0.78

7 tS/TS 385 0.15 1567 0.22

8 f 2116 0.82 4609 0.65

9 s 20,926 8.15 68,658 9.82

10 S 0 0.0 736 0.10

11 x 1994 0.78 4209 0.60

12 Z 720 0.28 3081 0.44

13 m 7718 3.01 21,601 3.09

14 n 12,021 4.68 51,493 7.36

15 n∼ 346 0.13 855 0.12

16 r( 14,784 5.76 38,467 5.50

17 r 1625 0.63 3546 0.50

18 l 14,058 5.48 32,356 4.63

19 tl 0 0.0 1 0.00014

20 i 9705 3.78 34,063 4.87

21 e 23,434 9.13 43,267 6.19

22 a 18,927 7.38 41,601 5.95

23 o 15,088 5.88 41,888 5.99

24 u 3431 1.34 13,099 1.87

25 i7 0 0.0 16,861 2.41

26 e7 0 0.0 61,711 8.83

27 a7 0 0.0 39,234 5.61

28 o7 0 0.0 26,233 3.75

29 u7 0 0.0 9417 1.34

Table13

ComparisonoftranscriptionsinMexbetT29.

Word Enhanced-Trans MexbetT29

pe˜nasco peNAsco pen∼a7sko

sexenio seKSEnio sekse7nio

xilófono $ilOfono silo7fono

xavier JaviEr xabie7r(

xolos SOlos So7los

• TT():“TT”istheacronymfor “TextTransformation”.This function produces the text transformations in Table 14 over the incoming word. All of them are perfectly reversible.

• TTINV():Producesthereversetransformationsmadebythe TT()function.

• divsil():Returnsthesyllabificationoftheincomingword.

• T29():ProducesaphonologicaltranscriptioninMexbetT29 oftheincomingword.

• T66():Producesaphonetic transcriptioninMexbetT66 of theincomingword.

5. Languagemodel

The language modelcaptures howwords are combinedin a language. In order to createa language model, a largeset of examples of sentences inthe targetlanguage isnecessary.

Table14

TransformationsadoptedtodophonologicaltranscriptionsinMexbet.

NoASCIIsymbol:

example

Transformation:

example

Orthographic irregularity:example

Phonemeequivalence:

example

Orthographic irregularity:example

Phonemeequivalence:

example

á:cuál cuAl cc:accionar /ks/:aksionar gui:guitarra /g/:gitaRa

é:café cafE ll:llamar /Z/:Zamar que:queso /k/:keso

í:maría marIa rr:carro R:caRo qui:quizá /k/:kisA

ó:noción nociOn ps:psicología /s/:sicologIa ce:cemento /s/:semento

ú:algún algUn ge:gelatina /x/:xelatina ci:cimiento /s/:simiento

ü:güero gwero gi:gitano /x/:xitano y(endofword):buey /i/:buei

˜n:ni˜no niNo gue:guerra /g/:geRa h(nosound):hola :ola

(9)

Table15

Characteristicsoftherawtextofthenewslettersusedtocreatethelanguage model.

Description Value

Totalofnewsletters 2489

Oldestnewsletter 12:30h01/Jan/2010

Newestnewsletter 20:00h18/Feb/2013

Totalnumberoftextlines 197,786

Totalwords 1,642,782

Vocabularysize 113,313

Averageofwordspernewsletters 660

Largestnewsletter 2710

Smallestnewsletter 21

Table16

Characteristicsoftheprocessedtextutilizedtocreatethelanguagemodel.

Description Value

Totalnumberofwords 1,505,491

Totalnumberofwordswithnorepetition 49,085

Totalnumberoftextlines 279,507

Averageofwordspertextline 5.38

Numberofwordsinthelargesttextline 43 Numberofwordsinthesmallesttextline 2

Asasourceof suchexamples,weuseauniversitynewsletter whichisaboutacademicactivities.¹⁹Thecharacteristicsofthe newslettersarepresentedinTable15.

Eventhoughtheamountoftextisrelativelysmallcompared withothernewsletters,itisstillbeingoneordermagnitudebig- gerthantheamountoftranscriptionsintheCIEMPIESScorpus, anditwillnotbringlegalissuestousbecauseitbelongstoour ownuniversity.

Thetexttakenfromthenewsletterswaslaterpost-processed.

First,theyweredividedintosentences,andwefilteredpunctu- ationsignsandextracodes(e.g,HTMLandstylisticsmarks).

Thedotsandcommasweresubstitutedwiththenewlinechar- acter to create a basic segmentation of the sentences. Every text line that included any word unable to be phonetized withourT29() or T66() functionswere excludedof the final version of the text. Additionally the lines with one unique word were excluded. Every word was marked with its cor- responding tonic vowel with the help of our automatic tool:

the vocal tonica() function. Table 16 shows the properties of the text utilized to create the language model after being processed.

Table17showsthecomparisonbetweenthe20mostcommon wordsintheprocessedtextutilizedtocreatethelanguagemodel;

the MSE amongthe two worddistributions is of 7.5×10⁻⁹ withacorrelationof0.98. Thesemetricspointouttoagood comprehensionoftheSpanishlanguageofMexicoCity.

19 Newsletter from website: http://www.dgcs.unam.mx/boletin/bdboletin/

basedgcs.html.

6. Evaluationexperiments

Inthissectionweshowsomeevaluationsofdifferentaspects of the corpus.First, we showthe evaluationof the automatic transcriptionusedduringthecreationofthepronunciationdic- tionary.Second,weshowdifferentbaselinesfordifferentspeech recognizersystems:HTK(Youngetal.,2006),Sphinx(Chan, Gouvea,Singh,Ravishankar,&Rosenfeld,2007;Lee,Hon,&

Reddy,1990)andKaldi(Poveyetal.,2011)whicharestateof theartASRsystems.Third,weshowanexperimentinwhich the benefitof markingthe tonicsyllables duringthe enhance transcriptioncanbeseen.

6.1. Automatictranscription

Intheseevaluationswemeasuretheperformanceofdiffer- entfunctionsoftheautomatictranscription.Firstweevaluated theperformanceofthevocal tonica()functionwhichindicates thetonicvowelofanincomingSpanishword.Forthis,weran- domlytook1452wordsfromtheCIEMPIESSvocabulary(12%

of thecorpus)andwe predictedtheir tonictranscription. The automaticgeneratedtranscriptionsweremanuallycheckedby anexpert.Theresultisthat90.35%ofthewordswerecorrectly predicted.Mostoftheerrorsoccurredinconjugatedverbsand propernames.Table18summarizestheresultsofthisevaluation.

ThesecondevaluationfocusesontheT29()functionwhich transcribeswordsatthephonologicallevel.Inthiscaseweeval- uatedagainsttheTRANSCRÍBEMEX(Pinedaetal.,2004)and thetranscriptionsdonemanuallybyexpertsfortheDIMEx100 corpus(Pinedaetal.,2004).Inordertocomparewiththesystem TRANSCRÍBEMEX,wetookthevocabularyoftheDIMEX100 corpusbutwehadtoeliminatesomewords.Firstweremoved entrieswiththe archiphonemes²⁰:[-B],[-D], [-G], [-N], [-R]

sincetheyarenotaone-to-onecorrespondencewithaphono- logical transcription. Then, words with the graphemex were eliminatedsinceTRANSCRÍBEMEXonlysupportsoneoffour pronunciations. After this, both systems produced the same transcription99.2%of thetimes.In ordertoevaluate against transcriptionsmadebyexperts,wetookthepronouncingdictio- naryoftheDIMEX100corpusandweremovedwordswiththe xphonemesandthealternativepronunciationsiftherewasany.

ThisshowsusthatthetranscriptionsmadebyourT29()function weresimilartothetranscriptionsmadebyexperts90.2%ofthe times.

Tables19and20summarizesthe resultsfor bothcompar- isons.Inconclusion,besidesthedifferentconventionsthereis not a noticeable difference, but when compared withhuman experts,therestillroomforimprovementofoursystem.

6.2. Benchmarksystems

Wecreatedthreebenchmarksbasedonstateoftheartsys- tems:HTK(Youngetal.,2006),Sphinx(Chanetal.,2007;Lee

20Anarchiphonemeisaphonologicalsymbolthatgroupsseveralphonemes together.Forexample,[-D]isequivalenttoanyofthephonemes/d/or/t/.

(10)

Table17

WordFrequencyofthelanguagemodelandtheCREAcorpus.

No. WordsinCREA Norm.freq.CREA Norm.freq.News No. WordsinCREA Norm.freq.CREA Norm.freq.News

1 de 0.065 0.076 11 las 0.011 0.011

2 la 0.041 0.045 12 un 0.010 0.009

3 que 0.030 0.028 13 por 0.010 0.010

4 el 0.029 0.029 14 con 0.009 0.010

5 en 0.027 0.034 15 no 0.009 0.006

6 y 0.027 0.032 16 una 0.008 0.008

7 a 0.021 0.017 17 su 0.007 0.005

8 los 0.017 0.016 18 para 0.006 0.010

9 se 0.013 0.015 19 es 0.006 0.008

10 del 0.012 0.013 20 al 0.006 0.005

Table18

Evaluationofthevocaltonica()function.

WordstakenfromtheCIEMPIESSdatabase 1539

NumberofForeignwordsomitted 87

NumberofWordsanalyzed 1452

Wrongaccentuation 140

Correctaccentuation 1312

Percentageofcorrectaccentuation 90.35%

Table19

ComparisonbetweenTRANSCRÍBEMEXandtheT29()function.

WordsinDIMEx100 11,575

Alternatepronunciations 2590

Wordswithgrapheme“x” 202

Wordswithgrapheme“x”intoanalternatepron. 87

Archiphonemes 45

Numberofwordsanalyzed 8738

Non-identicaltranscriptions 67

Identicaltranscriptions 8670

Percentageofidenticaltranscriptions 99.2%

Table20

ComparisonbetweentranscriptionsinDIMEx100dictionary(madebyhumans) andtheT29()function.

WordsinDIMEx100 11,575

Wordswithgrapheme“x” 289

Numberofwordsanalyzed 11,286

Non-identicaltranscriptions 1102

Identicaltranscriptions 10,184

Percentageofidenticaltranscriptions 90.23%

etal.,1990)andKaldi(Poveyetal., 2011).TheCIEMPIESS corpusisformattedtobeuseddirectlyinaSphinxsetting.Inthe caseofHTKwecreatedaseriesoftoolsthatcanreaddirectly theCIEMPIESScorpus²¹andfinallyforKaldiwecreatedthe settingfilesfortheCIEMPIESS.Wesetupaspeechrecognizer foreach systemusingthetrain setoftheCIEMPIESS corpus

21See the “HTK2SPHINX-CONVERTER” (Hernández-Mena & Herrera- Camacho, 2015) and the “HTK-BENCHMARK” available at http://www.

ciempiess.org/downloads.

Table21

Benchmarkamongdifferentsystems.

System WER(thelower

thebetter)

Sphinx 44.0%

HTK 42.45%

Kaldi 33.15%

Table22

Bestrecognitionresultsinlearningcurve.

Condition WER(thelower

thebetter)

T29TONICS 44.0%

T29NOTONICS 45.7%

T66TONICS 50.5%

T66NOTONICS 48.0%

andwe evaluatedtheperformanceutilizingitstestset. Every systemwereconfiguredwiththeircorrespondingdefaultparam- etersandatrigram-basedlanguagemodel.Table21showsthe performanceforeachsystem.

6.3. Tonicvowels

Given the characteristics of the CIEMPIESS corpus, we decidedtoevaluatetheeffectofthetonicvowelsmarkedinthe corpus.ForthiswetrainedfouracousticmodelsfortheSphinx system. These were tested in the corpus using the standard language model of the corpus. Table 22 presents the word errorrate(WER,thelowerthebetter)forsuchcases.Wecan observe that the distinctionof the tonicvowel helpsimprove theperformance.²²However,theuseofphonetictranscriptions (T66 level of MEXBET) does have a negative effect on the performanceofthespeechrecognizer.

Usingthesameconfigurationsfoundwiththeexperimentof thetonicvowels,wecreatedfourlearningcurves.Fig.3presents thecurves,wealsocannoticethataphonetictranscription(T66)

22In Table22“TONICS”meansthatweusedtonicvowelmarksforthe recognitionexperimentand“NOTONICS”meansthatwedidnot.