Availableonlineatwww.sciencedirect.com
Journal of Applied Research and Technology
www.jart.ccadet.unam.mx JournalofAppliedResearchandTechnology15(2017)259–270
Original
Automatic speech recognizers for Mexican Spanish and its open resources
Carlos Daniel Hernández-Mena
a,∗, Ivan V. Meza-Ruiz
b, José Abel Herrera-Camacho
aaLaboratoriodeTecnologíasdelLenguaje(LTL),UniversidadNacionalAutónomadeMéxico(UNAM),Mexico
bInstitutodeInvestigacionesenMatemáticasAplicadasyenSistemas(IIMAS),UniversidadNacionalAutónomadeMéxico(UNAM),Mexico Received21March2016;accepted9February2017
Availableonline28April2017
Abstract
Developmentof automaticspeechrecognitionsystemsrelieson the availabilityofdistinctlanguageresourcessuchas speechrecordings, pronunciationdictionaries,andlanguagemodels.TheseresourcesarescarcefortheMexicanSpanishdialect.Inthiswork,wepresentarevisionof theCIEMPIESScorpusthatisaresourceforspontaneousspeechrecognitioninMexicanSpanishofCentralMexico.Itconsistsof17hofsegmented andtranscribedrecordings,aphoneticdictionarycomposedby53,169uniquewords,andalanguagemodelcomposedby1,505,491wordsextracted from2489universitynewsletters.WealsoevaluatetheCIEMPIESScorpususingthreewellknownstateoftheartspeechrecognitionengines, havingsatisfactoryresults.Theseresourcesareopenforresearchanddevelopmentinthefield.Additionally,wepresentthemethodologyandthe toolsusedtofacilitatethecreationoftheseresourceswhichcanbeeasilyadaptedtoothervariantsofSpanish,orevenotherlanguages.
©2017UniversidadNacionalAutónomadeMéxico,CentrodeCienciasAplicadasyDesarrolloTecnológico.Thisisanopenaccessarticleunder theCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords:Automaticspeechrecognition;MexicanSpanish;Languageresources;Languagemodel;Acousticmodel
1. Introduction
Current advances in automatic speech recognition (ASR) havebeenpossiblegiventheavailablespeechresourcessuchas speechrecordings,orthographictranscriptions,phoneticalpha- bets,pronunciation dictionaries, largecollections of text and computationalsoftwarefor the constructionof ASRsystems.
However,theavailabilityoftheseresourcesvariesfromlanguage tolanguage.Untilrecently,thecreationofsuchresourceshas beenlargelyfocusedonEnglish.Thishashadapositiveeffecton thedevelopmentofresearchofthefieldandspeechtechnology forthislanguage.Thiseffecthasbeensopositivethattheinfor- mationandprocesseshavebeentransferredtootherlanguages sothatnowadaysthe mostsuccessfulrecognizersforSpanish languagearenotcreatedinSpanish-speakingcountries.Further- more,recentdevelopmentintheASRfieldreliesonrestricted corporawithrestrictedaccessor notaccessatall.Inorderto make progress in the study of spoken Spanish and take full
∗Correspondingauthor.
PeerReviewundertheresponsibilityofUniversidadNacionalAutónomade México.
advantage of theASR technology,we considerthat agreater amountofresourcesforSpanishneedstobefreelyavailableto theresearchcommunityandindustry.
Withthisinmind,wepresentamethodologyandresources associatedtoitfortheconstructionof ASRsystemsfor Mex- ican Spanish; we arguethat withminimaladaptations tothis methodology,itispossibletocreateresourcesforothervariants ofSpanishorevenotherlanguages.
Themethodologythatweproposefocusesonfacilitatingthe collectionoftheexamplesnecessariesforthecreationofanASR systemandtheautomaticconstructionofpronunciationdictio- naries.Thismethodologyhasbeenconcludedontwocollections thatwepresentinthiswork.Thefirstisthelargestcollectionof recordingsandtranscriptionsforMexicanSpanishfreelyavail- able forresearch,andthe second isalargecollection of text extractedfromauniversitymagazine.Thefirstcollectionwas collectedandtranscribedinaperiodoftwoyearsanditisuti- lizedtocreateacousticmodels.Thesecondcollectionisusedto createalanguagemodel.
We also present our system for the automatic generation of phonetic transcriptions of words. This system allows the creationofpronunciationdictionaries.Inparticular,thesetran- scriptions are basedonthe MEXBET (Cuetara-Priede, 2004)
http://dx.doi.org/10.1016/j.jart.2017.02.001
1665-6423/©2017UniversidadNacionalAutónomadeMéxico,CentrodeCienciasAplicadasyDesarrolloTecnológico.Thisisanopenaccessarticleunderthe CCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Table1
ASRCorporaforthetopfivespokenlanguagesintheworld.
Rank Language LDC ELRA Examples
1 Mandarin 24 6 TDT3(Graff,2001)
TC-STAR2005(TC-STAR,2006)
2 English 116 23 TIMIT(Garofolo,1993)
CLEFQAST(CLEF,2012)
3 Spanish 20 20 CALLHOME(Canavan&Zipperlen,1996)
TC-STARSpanish(TC-STAR,2007)
4 Hindi 9 4 OGIMultilanguage(Cole&Muthusamy,1994)
LILA(LILA,2012)
5 Arabic 10 32 WestPointArabic(LaRocca&Chouairi,2002)
NetDC(NetDC,2007)
phoneticalphabet,awellestablishalphabetforMexicanSpan- ish. Together, these resources are combined to create ASR systemsbased onthreefreelyavailablesoftware frameworks:
Sphinx, HTKandKaldi.The finalrecognizers areevaluated, compared,andmadeavailabletobeusedforresearchpurposes ortobeintegratedinSpanishspeechenabledsystems.
Finally,wepresentthecreationoftheCIEMPIESScorpus.
TheCIEMPIESS(Hernández-Mena,2015;Hernández-Mena&
Herrera-Camacho,2014)corpuswasdesignedtobeusedinthe field of the automatic speech recognition and we utilize our experience inthe creation of it as a concreteexample of the wholemethodologythatwepresentatthispaper.Thatiswhy theCIEMPIESSwillbeembeddedinallourexplanationsand examples.
Thepaperhasthefollowingoutline:InSection2wepresent arevisionofcorporaavailableforautomaticspeechrecognition inSpanishandMexicanSpanish,inSection3wepresenthowan acousticmodeliscreatedfromaudiorecordingsandtheirortho- graphictranscriptions.InSection4weexplainhowtogeneratea pronunciationdictionaryusingourautomatictools.InSection5 weshowhowtocreatealanguagemodel,Section6showshow we evaluatedthe databaseinareal ASRsystemandhowwe validatetheautomatictoolswearepresentingatthispaper.At theend,inSection7,wediscussourfinalconclusions.
2. Spanishlanguageresources
Accordingtothe“Anuario2013”1 createdbythe Instituto Cervantes2andthe“Atlasdelalenguaespa˜nolaenelmundo”
(Moreno-Fernández&Otero,2007)theSpanishlanguageisone ofthetopfivemorespokenlanguagesintheworld.Actually,the InstitutoCervantesmakesthefollowingremarks:
• Spanishisthesecondmorespokennativelanguagejustbehind MandarinChinese.
• Spanishisthesecondlanguageforinternationalcommunica- tion.
1 Available for web download at: http://cvc.cervantes.es/lengua/anuario/
anuario13/(August2015).
2 TheInstitutoCervantes(http://www.cervantes.es/).
• Spanishisspokenbymorethan500millionspersons,includ- ing speakers who use it as a native language or a second language.
• Itisprojectedthatin2030,7.5%ofthepeopleoftheworld willspeakSpanish.
• MexicoisthecountrywiththemostSpanishspeakersamong Spanishspeakingcountries.
Thisspeakstothe importanceof Spanish forspeechtech- nologieswhichcanbecorroboratedbytheamountofavailable resources forASRinSpanish.ThiscanbenoticedinTable1 thatsummarizestheamountofASRresourcesavailableinthe LinguisticDataConsortium3(LDC)andintheEuropeanLan- guage Resources Association4 (ELRA)for the topfive more spoken languages.5Asonecan seethe resources forEnglish areabundantcomparedtotherestof thetoplanguages.How- ever, intheparticularcaseoftheSpanishlanguage,thereisa goodamountofresourcesreportedinthesedatabases.Besides additionalresourcesforSpanishcanbefoundinothersources such as:reviewsinthe field(Llisterri,2004;Raab,Gruhn,&
Noeth,2007),the“LREMap”,6andproceedingsofspecialized conferences,suchasLREC(Calzolarietal.,2014).
2.1. MexicanSpanishresources
In theprevious section, onecan noticethat there are sev- eral optionsavailablefortheSpanishlanguage,butwhenone focuses ina dialect such as Mexican Spanish, the resources are scarcer.In theliteratureonecanfindseveralarticles ded- icated to the creation of speech resources for the Mexican Spanish,(Kirschning,2001;Olguín-Espinoza,Mayorga-Ortiz, Hidalgo-Silva, Vizcarra-Corral, & Mendiola-Cárdenas, 2013;
Uraga & Gamboa, 2004). However, researchers usually cre- ate small databases todo experiments so onehas to contact theauthorsanddependontheirgoodwilltogetacopyofthe resource(Audhkhasi,Georgiou,&Narayanan,2011;deLuna Ortega, Mora-González,Martínez-Romo, Luna-Rosas, &Mu
3https://www.ldc.upenn.edu/.
4http://www.elra.info/en/.
5Formoredetailsontheresourcesperlanguagevisit:http://www.ciempiess.
org/corpus/CorpusforASR.html.
6http://www.resourcebook.eu/searchll.php.
Table2
CorpusforASRthatincludetheMexicanSpanishlanguage.
Name Size Dialect Datasources Availability
DIMEx100(Pineda,Pineda,Cuétara, Castellanos,&López,2004)
6.1h MexicanSpanishofCentral Mexico
ReadUtterances FreeOpenLicense
1997SpanishBroadcastNews SpeechHUB4-NE(Others,1998)
30h IncludesMexicanSpanish BroadcastNews Since2015
LDC98S74$400.00 USD
1997HUB4BroadcastNews EvaluationNon-EnglishTest Material(Fiscus,2001)
1h IncludesMexicanSpanish BroadcastNews LDC2001S91
$150.00USD
LATINO-40(Bernstein,1995) 6.8h SeveralcountriesofLatin AmericaincludingMexico
MicrophoneSpeech LDC95S28$1000.00 USD
WestPointHeroicoSpanishSpeech (Morgan,2006)
16.6h IncludesMexicanSpanishof CentralMexico
MicrophoneSpeech (read)
LDC2006S37
$500.00USD FisherSpanishSpeech(Graff,2010) 163h Caribbeanandnon-Caribbean
Spanish(includingMexico)
TelephoneConversations LDC2010S01
$2500.00USD Hispanic–EnglishDatabase(Byrne,
2014)
30h SpeakersfromCentraland SouthAmerica
MicrophoneSpeech (conversationalandread speech)
LDC2014S05
$1500.00USD
noz-Maciel,2014;Moya, Hernández,Pineda, &Meza,2011;
Varela,Cuayáhuitl,&Nolazco-Flores,2003).
Eventhoughthe resourcesinMexicanSpanish arescarce, we identified7 corporawhichare easilyavailable. Theseare presentedinTable2.Asshowninthetable,onehastopayin ordertohaveaccesstomostoftheresources.Thenotableexcep- tionistheDIMEx100corpus(Pinedaetal.,2010)whichwas recentlymadeavailable.7Theproblemwiththisresourceisthat thecorpusiscomposedforreadingmaterialanditisonly6h longwhichlimitsthetypeofacousticphenomenapresent.This imposesalimit on theperformanceof the speechrecognizer createdutilizingthisresource(Moyaetal.,2011).Inthiswork wepresentthecreationoftheCIEMPIESScorpusanditsopen resources.The CIEMPIESS corpus consistsof 17h of recor- dingsofCentralMexicoSpanishbroadcastofinterviewswhich providesspontaneousspeech.Thismakesitagoodcandidate forthecreationofspeechrecognizers.Besidesthis,weexpose themethodologyandtoolscreatedfortheharvestingofthedif- ferentaspectsofthecorpussothesecanbereplicatedinother underrepresenteddialectsofSpanish.
3. Acousticmodeling
For several decades, ASR technology has relied on the machinelearningapproach.In thisapproach,examplesofthe phenomenon are learned. A model resulting from the learn- ingis createdandlater usedto predict suchphenomenon. In anASRsystem,therearetwosourcesofexamplesneededfor itsconstruction.Thefirstoneisacollectionofrecordingsand theircorrespondingtranscriptions.Theseareusedtomodelthe relationbetweensoundandphonemes.Theresultingmodelis usuallyreferredas theacousticmodel.Thesecond sourceof examplesofsentencesinalanguagewhichisusuallyobtained
7 For web downloading at: http://turing.iimas.unam.mx/luis/DIME/
CORPUS-DIMEX.html.
Pronouncing dictionary
al al de de por por(
Language model
ASR System
Speech
Acoustic models
Text...
Text...
Text...
Text
Fig.1.Componentsandmodelsforautomaticspeechrecognition.
fromalargecollectionoftexts.Theseareusedtolearnamodel ofhowphrasesarebuiltbyasequenceofwords.Theresulting modelisusuallyreferredasthelanguagemodel.Additionally, anASRsystemusesadictionaryofpronunciationstolinkboth theacousticandthelanguagemodels,sincethiscaptureshow phonemescomposewords.Fig.1illustratestheseelementsand howtheyrelatetoeachother.
Fig.2showsindetailthefullprocessandtheelementsneeded to createacoustic models.First of all,recordings of the cor- puspassthroughafeatureextractionmodulewhichcalculates thespectralinformationoftheincomingrecordingsandtrans- formsthemintoaformatthetrainingmodulecanhandle.Alist ofphonemesmustbeprovidedtothesystem.Thetaskoffill- ingeverymodelwithstatisticinformationisperformedbythe trainingprocess.
Pronouncing dictionary
tS a r(
al al de de por por(
Training module
Feature extraction Recordings
&
transcriptions
Acoustic models
Training results
Corpus List of phonemes
Fig.2.Architectureofa training moduleforautomaticspeechrecognition systems.
Table3
CIEMPIESSoriginalaudiofileproperties.
Description Properties
Numberofsourcefiles 43
Formatofsourcefiles mp3/44.1kHz/128kbps Durationofallthefilestogether 41h21min
Durationofthelongestfile 1h26min41s Durationoftheshortestfile 43min12s Numberofdifferentradioshows 6
3.1. Audiocollection
TheoriginalsourceofrecordingsusedtocreatetheCIEM- PIESS corpus comes from radio interviews in the format of a podcast.8 We chose this source because they were easily available,theyhadseveralspeakerswiththeaccentofCentral Mexico, andallof themwere freelyspeaking. Thefiles sum togetheratotalof43one-hourepisodes.9Table3summarizes themaincharacteristics ofthe originalversionof theseaudio files.
3.2. Segmentationofutterances
Fromtheoriginal recordings,thesegmentsof speechneed to be identified. To define a “good” segment of speech, the followingcriteriawastakenintoaccount:
• Segmentswithauniquespeaker.
• Segmentscorrespondtoanutterance.
• Thereshouldnotbemusicinthebackground.
8 Originallytransmittedby:“RADIOIUS”(http://www.derecho.unam.mx/
cultura-juridica/radio.php)andavailableforwebdownloadingat:PODSCAT- UNAM(http://podcast.unam.mx/).
9 For moredetails visit: http://www.ciempiess.org/CIEMPIESSStatistics.
html#Tabla2.
Table4
CharacteristicsoftheutterancerecordingsoftheCIEMPIESScorpus.
Characteristic Training Test
Numberofutterances 16,017 1000
Totalofwordsandlabels 215,271 4988
Numberofwordswithnorepetition 12,105 1177
Numberofrecordings 16,017 1000
Totalamountoftimeofthetrainset (hours)
17.22 0.57
Averagedurationperrecording (seconds)
3.87 2.085
Durationofthelongestrecording (seconds)
56.68 10.28
Durationoftheshortestrecording (seconds)
0.23 0.38
Averageofwordsperutterance 13.4 4.988
Maximumnumberofwordsinan utterance
182 37
Minimumnumberofwordsinan utterance
2 2
• Thebackgroundnoiseshouldbeminimum.
• Thespeakershouldnotbewhispering.
• The speaker should not have an accent otherthan Central Mexico.
Attheend,16,717utteranceswereidentified.Thisisequiv- alentto17hofonly“clean”speechaudio.78%ofthesegments comefrommalespeakersand22%fromfemalespeakers.10This genderimbalanceisnotuncommoninothercorpora(forexam- pleseeFederico,Giordani,&Coletti,2000;Wang,Chen,Kuo,
&Cheng,2005sincegenderbalancingisnotalwayspossibleas inLangmann,Haeb-Umbach,Boves,&denOs,1996;Larcher, Lee,Ma,&Li,2012).Thesegmentsweredividedintotwosets:
training (16,017)andtest (700),the test setwas additionally complemented by300 utterancesfrom differentsources such as:interviews,broadcastnewsandreadspeech.Weaddedthese 300utterancesfromalittlecorpusthatbelongtoourlaboratory, to performprivateexperiments thatare important tosomeof ourstudents.Table4summarizesthemaincharacteristicsofthe utterancerecordingsoftheCIEMPIESScorpus.11
Theaudiooftheutteranceswasstandardizedintorecordings sampledat16kHz,with16-bitsinaNISTSpherePCMmono formatwithanoiseremovalfilteringwhennecessary.
3.3. Orthographictranscriptionofutterances
Inadditiontotheaudiocollectionofutterances,itwasneces- sarytohavetheirorthographictranscriptions.Inordertocreate these transcriptions we followed these guidelines: First, the process beginswithacanonical orthographic transcriptionof everyutteranceinthe corpus.Later,thesetranscriptionswere enhancedtomarkcertainphenomenaoftheSpanishlanguage.
10Toseeatablethatshowswhichsentencesbelongtoaparticularspeaker, visit:http://www.ciempiess.org/CIEMPIESSStatistics.html#Tabla8.
11Formoredetails,seethischartat:http://www.ciempiess.org/CIEMPIESS Statistics.html#Tabla1.
Theconsiderationswetookintoaccountfortheenhancedver- sionwere:
• Donotusecapitalizationandpunctuation
• Expandabbreviations(e.g.theabbreviation“PRD”waswrit- tenas:“peerrede”).
• Numbersmustbewrittenorthographically.
• Specialcharacterswereintroduced(e.g.Nfor ˜n,Wforu,¨ $ whenletter“x”soundslikephoneme/s/,S whenletter“x”
soundslikephoneme/ʃ/).
• Theletter“x”hasmultiplephonemesassociatedtoit.
• Wewrotethetonicvowelofeverywordinuppercase.
• Wemarkedthesilencesanddisfluencies.
3.4. Enhancedtranscription
Aswe have mentioned inthe previous section, the ortho- graphictranscriptionwasenhancedbyaddinginformationofthe pronunciationofthewordsandthedisfluenciesinthespeech.
Therestofthissectionpresentssuchenhancements.
Mostof thewords inSpanish containenough information abouthow topronounce them,however thereare exceptions.
Inordertofacilitatetheautomaticcreationofadictionary,we addedinformation of the pronunciation of the x’s whichhas severalpronunciationintheMexicanSpanish.Annotatorswere askedtoreplace x’sbyan approximation ofits possiblepro- nunciations.Doingthis,weeliminatetheneedforanexception dictionary.Table5exemplifiessuchcases.
Anothermarkweannotateintheorthographictranscription istheindicationofthetonicvowelinaword.InSpanish,atonic vowelisusuallyidentifiedbyaraiseinthepitch;sometimesthis vowelisexplicitlymarkedintheorthographyofthewordbyan acutesign(e.g.“acción”,“vivía”).Inordertomakethisdiffer- ence explicitamongsounds, theenhanced transcriptions also markedassuchinboththeexplicitandimplicitcases(Table6 exemplifiesthisconsideration).Wedidthisfortworeasons,the mostimportantisthatwewanttoexploretheeffectoftonicand non-tonicvowelsinthespeechrecognition,andtheotherone isthatsomesoftwaretoolscreatedforHTKorSPHINXdonot managecharacterswithacutesignsproperly,sothebestthing todoistouseonlyASCIIsymbols.
Finally,silencesanddisfluenciesweremarkedfollowingthis procedure:
Table5
Enhancementfor“x”letter.
Canonicaltranscriptions Phoneme equivalence (IPA)
Transcription mark
Enhanced transcription
sexto,oxígeno /ks/ KS sEKSto,
oKSIgeno
xochimilco,xilófono /s/ $ $ochimIlco,
$ilOfono
xolos,xicoténcatl /ʃ/ S SOlos,
SicotEncatl
ximena,xavier /x/ J JimEna,
JaviEr
Table6
Exampleoftonicmarksinenhancedtranscription.
Canonical Enhanced Canonical Enhanced
ambulancia ambulAncia perla pErla
química quImica aglutinado aglutinAdo
ni˜no nINo ejemplar ejemplAr
ping˜nino pingWIno tobogán tobogAn
Table7
Exampleofenhancetranscriptions.
Original
<s>apartirdela˜nomilnovecientosnoventaysiete</s>(S1)
<s>esunaformadeexpresiondelossentimientos</s>(S2) Enhanced
<s><sil>ApartIrdElANo<sil>mIlnoveciEntos++dis++novEnta ysiEte<sil></s>(S1)
<s><sil>Es++dis++UnafOrmadEeKSpresiOn<sil>dElOs sentimiEntos<sil></s>(S2)
• Anautomaticanduniformalignmentwasproducedusingthe wordsinthetranscriptionsandtheutteranceaudios.
• Annotators were askedtoalignthe words withtheir audio usingthePRAATsystem(Boersma&Weenink,2013).
• Whentherewasspeechwhichdidnotcorrespondtotheaudio, the annotators were asked toanalyzeif it was a caseof a disfluency. If positive, they were asked to mark it with a ++dis++.
• Theannotatorswerealsoasktomarkevidentsilencesinthe speechwitha<sil>.
Table7comparesacanonicaltranscriptionanditsenhanced version.
Bothsegmentation andorthographic transcriptionsintheir canonicalandenhanceversion(wordalignment)areverytime consumingprocesses.Inordertotranscribeandalignthefull utterances,we collaboratedwith20collegestudentsinvesting 480heachtotheprojectovertwoyears.Threeofthemmade theselectionofutterancesfromtheoriginalaudiofilesusingthe Audacitytool(Team,2012)andtheycreatedtheorthographic transcriptionsinaperiodofsixmonths,attherateofonehourper week.Therestofcollaboratorsspentoneandahalfyearaligning thewordtranscriptionswiththeutterancesfordetectingsilences anddisfluencies.Thislaststepimpliesthatorthographic tran- scriptionswerecheckedatleasttwicebytwodifferentperson.
Table8showsachronographofeachtaskdoneperhalfyear.
Table8
Numberofstudentsworkingpersemesterandtheirlabors.
Semester Students Labor
1st 3 Audioselectionand
orthographictranscriptions
2nd 6 Wordalignment
3rd 6 Wordalignment
4th 5 Wordalignment
Table9
WordfrequencybetweenCIEMPIESSandCREAcorpus.
No. WordsinCREA Norm.freq.
CREA
Norm.freq.
CIEMPIESS
No. WordsinCREA Norm.freq.CREA Norm.freq.
CIEMPIESS
1 de 0.065 0.056 11 las 0.011 0.008
2 la 0.041 0.033 12 un 0.010 0.011
3 que 0.030 0.051 13 por 0.010 0.008
4 el 0.029 0.026 14 con 0.009 0.006
5 en 0.027 0.025 15 no 0.009 0.015
6 y 0.027 0.022 16 una 0.008 0.010
7 a 0.021 0.026 17 su 0.007 0.003
8 los 0.017 0.014 18 para 0.006 0.008
9 se 0.013 0.014 19 es 0.006 0.017
10 del 0.012 0.008 20 al 0.006 0.004
3.5. Distributionofwords
In ordertoverify that the distributionof words inCIEM- PIESScorrespondstothedistributionofwordsintheSpanish language,wecomparethedistributionofthefunctionalwordsin thecorpusversusthe“CorpusdeReferenciadelEspa˜nolActual”
(ReferenceCorpusofCurrentSpanish,CREA).12 The CREA corpusisconformedby140,000textdocuments.Thatis,more than154millionofwordsextractedfrombooks(49%),news- papers(49%),andmiscellaneoussources(2%).Italsohasmore than700,000wordtokens.Table9illustratestheminordiffer- ences between frequenciesof the 20most frequent words in CREAandtheirfrequencyinCIEMPIESS.13 Asonecansee, thedistributionoffunctionalwordsisproportionalbetweenboth corpora.
Wealsocalculatethemeansquareerror(MSE=9.3×10−8) ofthenormalizedfrequenciesofthewordsbetweentheCREA andthewholeCIEMPIESSsowefoundthatitislowandthe correlationofthetwodistributionsis0.95.Ourinterpretationof thisisthatthedistributionofwordsintheCIEMPIESSreflects thedistributionofwordsintheSpanishlanguage.Wearguethat thisisrelevantbecausetheCIEMPIESSisthenagoodsample thatreflectswellthebehaviorofthelanguagethatitpretendsto model.
4. Pronouncingdictionaries
The examples of audio utterances and their transcriptions alone are not enough to start the training procedure of the ASRsystem.Inordertolearnhowthebasicsoundsof alan- guagesound, itisnecessarytotranslatewordsintosequences of phonemes. This information is codified in the pronuncia- tiondictionary,whichproposesoneormorepronunciationfor eachword. Thesepronunciationsaredescribedas asequence ofphonemesforwhichacanonicalsetofphonemeshastobe decided. In the creation of the CIEMPIESS corpus,we pro-
12See:http://www.rae.es/recursos/banco-de-datos/crea-escrito.
13DownloadthewordfrequenciesofthewordsofCREAfrom:http://corpus.
rae.es/lfrecuencias.html.
posedtheautomaticextractionofpronunciationsbasedonthe enhancedtranscriptions.
4.1. Phoneticalphabet
In thisworkweused theMEXBET phoneticalphabetthat hasbeenproposedtoencodethephonemesandtheallophones of MexicanSpanish(Cuetara-Priede, 2004;Hernández-Mena, Martínez-Gómez,&Herrera-Camacho,2014;Uraga&Pineda, 2000).MEXBETisaheritageofourUniversityandithasbeen successfullyused overtheyearsinseveralarticlesandthesis.
Nevertheless,thebestreasonforchoosingMEXBETisthatthis isthemostupdatedphoneticalphabetfortheMexicanSpanish dialect.
Thisalphabethasthreelevelsofgranularityfromthephono- logical(T22)tothephonetic(T44andT54).14Forthepurpose oftheCIEMPIESScorpusweextendedtheT22andT54levels towhatwecallT29andT66.15InthecaseofT29thesearethe mainchanges:
• ForT29weaddedthephoneme/tl/asiniztaccíhuatl→/is.
tak. si.ua.tl/→[ .tak. si.wa.tl].Eventhough thecountsofthephoneme/tl/aresolowintheCIEMPIESS corpus,we decided toinclude itintoMEXBET becuasein Mexico,manypropernamesofplacesneedittohaveacorrect phonetictranscription.
• ForT29weaddedthephoneme/S/asinxolos→/ ʃo.los /→[ʃo.l s]
• ForT29we consideredthesymbols/a7/, /e7/,/i 7/,/o7/
and/u7/of the levelsT44 andT54used toindicate tonic vowelsinwordtranscriptions.
All these changeswere motivated by the need toproduce moreaccuratephonetictranscriptionsfollowingtheanalysisof
14For more detail on the different levels and the evolution of MEX- BET through time see the charts in: http://www.ciempiess.org/Alfabetos Foneticos/EVOLUTIONofMEXBET.html.
15Inourpreviouspapers,werefertothelevelT29asT22andthelevelT66as T50butthisisincorrectbecausethenumber“22”or“44”,etc.mustreflectthe numberofphonemesandallophonesconsideredinthatlevelofMEXBET.
Table10
ComparisonbetweenMEXBETT66fortheCIEMPIESS(leftorinbold)andDIMEx100T54(right)databases.
Consonants Labial Labiodental Dental Alveolar Palatal Velar
UnvoicedStops p:p/pc t:t/tc kj:kj/kc k:k/kc
VoicedStops b:b/b c d:d/dc g:g/gc
UnvoicedAffricate tS:tS/tSc
VoicedAffricate dZ:dZ/dZc
UnvoicedFricative f s[ s S x
VoicedFricative V Dz[ z Z G
Nasal m M n[ n njn∼ N
VoicedLateral l[ l tllj
VoicelessLateral l0
Rhotic r(r
VoicelessRhotic r(0r(\
Vowels Palatal Central Velar
Semi-consonants j w
i( u(
Close i u
I U
Mid e o
E O
Open aj a a2
TonicVowels Palatal Central Velar
Semi-consonants j7 w7
i( 7 u( 7
Close i7 u7
I7 U7
Mid e7 o7
E7 O7
Open aj7 a7 a27
Table11
ExampleoftranscriptionsinIPAagainsttranscriptionsinMEXBET.
Word PhonologicalIPA PhoneticIPA
ineptitud i.nep.ti. tud i.n p.ti. t indulgencia in.dul. xen.sia .d l. xen.sja institución ins.ti.tu. sion n .ti.tu. sj n
MEXBETT29 MEXBETT66
ineptitud ineptitu7d inEptitU7D indulgencia indulxe7nsia In[dUlxe7nsja institución institusio7n Ins[titusiO7n
Cuetara-Priede(2004).Table10illustratesthemaindifference betweenT54andT66levelsoftheMEXBETalphabets.
Table11showsexamples ofdifferent Spanishwords tran- scribedusingthesymbolsoftheInternationalPhoneticAlphabet (IPA)againstthesymbolsofMEXBET.16
Table 12 shows the distribution of the phonemes in the automatically generated dictionary and compares it with the
16 To see the equivalences between IPA and MEXBET symbols see:
http://www.ciempiess.org/AlfabetosFoneticos/EVOLUTIONofMEXBET.
html#Tabla5.
DIMEx100corpus.Weobservethatbothcorporashareasimilar distribution.17
4.2. Characteristicsofthepronouncingdictionaries
Thepronunciationdictionaryconsistsofalistofwordsfor the target language and their pronunciation at the phonetic or phonological level. Based on the enhanced transcription, we automatically transcribethe pronunciation for eachword.
Forthis,wefollowedtherulesfromCuetara-Priede(2004)and Hernández-Menaet al.(2014). Theproduceddictionarycon- sistsof 53,169 words.Table13 showssomeexamples of the automaticallycreatedtranscriptions.
Theautomatictranscriptionwasdoneusingthefonetica218 library,thatincludestranscriptionroutinesbasedonrulesforthe T29andtheT66levelsofMEXBET.Thislibraryimplements thefollowingfunctions:
• vocal tonica():Returnsthesameincomingwordbutwithits tonicvowelinuppercase(e.g.cAsa,pErro,gAto,etc.).
17ToseeasimilartablewhichshowsdistributionsoftheT66levelofCIEM- PIESS,seehttp://www.ciempiess.org/CIEMPIESSStatistics.html#Tabla6.
18Availableathttp://www.ciempiess.org/downloads,andforademonstration, gotohttp://www.ciempiess.org/tools.
Table12
PhonemedistributionoftheT29leveloftheCIEMPIESScomparedtotheT22levelofDIMEx100corpus.
No. Phoneme InstancesDIMEx100 PercentageDIMEx100 InstancesCIEMPIESS PercentageCIEMPIESS
1 p 6730 2.42 19,628 2.80
2 t 12,246 4.77 35,646 5.10
3 k 8464 3.30 29,649 4.24
4 b 1303 0.51 15,361 2.19
5 d 3881 1.51 34,443 4.92
6 g 426 0.17 5496 0.78
7 tS/TS 385 0.15 1567 0.22
8 f 2116 0.82 4609 0.65
9 s 20,926 8.15 68,658 9.82
10 S 0 0.0 736 0.10
11 x 1994 0.78 4209 0.60
12 Z 720 0.28 3081 0.44
13 m 7718 3.01 21,601 3.09
14 n 12,021 4.68 51,493 7.36
15 n∼ 346 0.13 855 0.12
16 r( 14,784 5.76 38,467 5.50
17 r 1625 0.63 3546 0.50
18 l 14,058 5.48 32,356 4.63
19 tl 0 0.0 1 0.00014
20 i 9705 3.78 34,063 4.87
21 e 23,434 9.13 43,267 6.19
22 a 18,927 7.38 41,601 5.95
23 o 15,088 5.88 41,888 5.99
24 u 3431 1.34 13,099 1.87
25 i7 0 0.0 16,861 2.41
26 e7 0 0.0 61,711 8.83
27 a7 0 0.0 39,234 5.61
28 o7 0 0.0 26,233 3.75
29 u7 0 0.0 9417 1.34
Table13
ComparisonoftranscriptionsinMexbetT29.
Word Enhanced-Trans MexbetT29
pe˜nasco peNAsco pen∼a7sko
sexenio seKSEnio sekse7nio
xilófono $ilOfono silo7fono
xavier JaviEr xabie7r(
xolos SOlos So7los
• TT():“TT”istheacronymfor “TextTransformation”.This function produces the text transformations in Table 14 over the incoming word. All of them are perfectly reversible.
• TTINV():Producesthereversetransformationsmadebythe TT()function.
• divsil():Returnsthesyllabificationoftheincomingword.
• T29():ProducesaphonologicaltranscriptioninMexbetT29 oftheincomingword.
• T66():Producesaphonetic transcriptioninMexbetT66 of theincomingword.
5. Languagemodel
The language modelcaptures howwords are combinedin a language. In order to createa language model, a largeset of examples of sentences inthe targetlanguage isnecessary.
Table14
TransformationsadoptedtodophonologicaltranscriptionsinMexbet.
NoASCIIsymbol:
example
Transformation:
example
Orthographic irregularity:example
Phonemeequivalence:
example
Orthographic irregularity:example
Phonemeequivalence:
example
á:cuál cuAl cc:accionar /ks/:aksionar gui:guitarra /g/:gitaRa
é:café cafE ll:llamar /Z/:Zamar que:queso /k/:keso
í:maría marIa rr:carro R:caRo qui:quizá /k/:kisA
ó:noción nociOn ps:psicología /s/:sicologIa ce:cemento /s/:semento
ú:algún algUn ge:gelatina /x/:xelatina ci:cimiento /s/:simiento
ü:güero gwero gi:gitano /x/:xitano y(endofword):buey /i/:buei
˜n:ni˜no niNo gue:guerra /g/:geRa h(nosound):hola :ola
Table15
Characteristicsoftherawtextofthenewslettersusedtocreatethelanguage model.
Description Value
Totalofnewsletters 2489
Oldestnewsletter 12:30h01/Jan/2010
Newestnewsletter 20:00h18/Feb/2013
Totalnumberoftextlines 197,786
Totalwords 1,642,782
Vocabularysize 113,313
Averageofwordspernewsletters 660
Largestnewsletter 2710
Smallestnewsletter 21
Table16
Characteristicsoftheprocessedtextutilizedtocreatethelanguagemodel.
Description Value
Totalnumberofwords 1,505,491
Totalnumberofwordswithnorepetition 49,085
Totalnumberoftextlines 279,507
Averageofwordspertextline 5.38
Numberofwordsinthelargesttextline 43 Numberofwordsinthesmallesttextline 2
Asasourceof suchexamples,weuseauniversitynewsletter whichisaboutacademicactivities.19Thecharacteristicsofthe newslettersarepresentedinTable15.
Eventhoughtheamountoftextisrelativelysmallcompared withothernewsletters,itisstillbeingoneordermagnitudebig- gerthantheamountoftranscriptionsintheCIEMPIESScorpus, anditwillnotbringlegalissuestousbecauseitbelongstoour ownuniversity.
Thetexttakenfromthenewsletterswaslaterpost-processed.
First,theyweredividedintosentences,andwefilteredpunctu- ationsignsandextracodes(e.g,HTMLandstylisticsmarks).
Thedotsandcommasweresubstitutedwiththenewlinechar- acter to create a basic segmentation of the sentences. Every text line that included any word unable to be phonetized withourT29() or T66() functionswere excludedof the final version of the text. Additionally the lines with one unique word were excluded. Every word was marked with its cor- responding tonic vowel with the help of our automatic tool:
the vocal tonica() function. Table 16 shows the properties of the text utilized to create the language model after being processed.
Table17showsthecomparisonbetweenthe20mostcommon wordsintheprocessedtextutilizedtocreatethelanguagemodel;
the MSE amongthe two worddistributions is of 7.5×10−9 withacorrelationof0.98. Thesemetricspointouttoagood comprehensionoftheSpanishlanguageofMexicoCity.
19 Newsletter from website: http://www.dgcs.unam.mx/boletin/bdboletin/
basedgcs.html.
6. Evaluationexperiments
Inthissectionweshowsomeevaluationsofdifferentaspects of the corpus.First, we showthe evaluationof the automatic transcriptionusedduringthecreationofthepronunciationdic- tionary.Second,weshowdifferentbaselinesfordifferentspeech recognizersystems:HTK(Youngetal.,2006),Sphinx(Chan, Gouvea,Singh,Ravishankar,&Rosenfeld,2007;Lee,Hon,&
Reddy,1990)andKaldi(Poveyetal.,2011)whicharestateof theartASRsystems.Third,weshowanexperimentinwhich the benefitof markingthe tonicsyllables duringthe enhance transcriptioncanbeseen.
6.1. Automatictranscription
Intheseevaluationswemeasuretheperformanceofdiffer- entfunctionsoftheautomatictranscription.Firstweevaluated theperformanceofthevocal tonica()functionwhichindicates thetonicvowelofanincomingSpanishword.Forthis,weran- domlytook1452wordsfromtheCIEMPIESSvocabulary(12%
of thecorpus)andwe predictedtheir tonictranscription. The automaticgeneratedtranscriptionsweremanuallycheckedby anexpert.Theresultisthat90.35%ofthewordswerecorrectly predicted.Mostoftheerrorsoccurredinconjugatedverbsand propernames.Table18summarizestheresultsofthisevaluation.
ThesecondevaluationfocusesontheT29()functionwhich transcribeswordsatthephonologicallevel.Inthiscaseweeval- uatedagainsttheTRANSCRÍBEMEX(Pinedaetal.,2004)and thetranscriptionsdonemanuallybyexpertsfortheDIMEx100 corpus(Pinedaetal.,2004).Inordertocomparewiththesystem TRANSCRÍBEMEX,wetookthevocabularyoftheDIMEX100 corpusbutwehadtoeliminatesomewords.Firstweremoved entrieswiththe archiphonemes20:[-B],[-D], [-G], [-N], [-R]
sincetheyarenotaone-to-onecorrespondencewithaphono- logical transcription. Then, words with the graphemex were eliminatedsinceTRANSCRÍBEMEXonlysupportsoneoffour pronunciations. After this, both systems produced the same transcription99.2%of thetimes.In ordertoevaluate against transcriptionsmadebyexperts,wetookthepronouncingdictio- naryoftheDIMEX100corpusandweremovedwordswiththe xphonemesandthealternativepronunciationsiftherewasany.
ThisshowsusthatthetranscriptionsmadebyourT29()function weresimilartothetranscriptionsmadebyexperts90.2%ofthe times.
Tables19and20summarizesthe resultsfor bothcompar- isons.Inconclusion,besidesthedifferentconventionsthereis not a noticeable difference, but when compared withhuman experts,therestillroomforimprovementofoursystem.
6.2. Benchmarksystems
Wecreatedthreebenchmarksbasedonstateoftheartsys- tems:HTK(Youngetal.,2006),Sphinx(Chanetal.,2007;Lee
20Anarchiphonemeisaphonologicalsymbolthatgroupsseveralphonemes together.Forexample,[-D]isequivalenttoanyofthephonemes/d/or/t/.
Table17
WordFrequencyofthelanguagemodelandtheCREAcorpus.
No. WordsinCREA Norm.freq.CREA Norm.freq.News No. WordsinCREA Norm.freq.CREA Norm.freq.News
1 de 0.065 0.076 11 las 0.011 0.011
2 la 0.041 0.045 12 un 0.010 0.009
3 que 0.030 0.028 13 por 0.010 0.010
4 el 0.029 0.029 14 con 0.009 0.010
5 en 0.027 0.034 15 no 0.009 0.006
6 y 0.027 0.032 16 una 0.008 0.008
7 a 0.021 0.017 17 su 0.007 0.005
8 los 0.017 0.016 18 para 0.006 0.010
9 se 0.013 0.015 19 es 0.006 0.008
10 del 0.012 0.013 20 al 0.006 0.005
Table18
Evaluationofthevocaltonica()function.
WordstakenfromtheCIEMPIESSdatabase 1539
NumberofForeignwordsomitted 87
NumberofWordsanalyzed 1452
Wrongaccentuation 140
Correctaccentuation 1312
Percentageofcorrectaccentuation 90.35%
Table19
ComparisonbetweenTRANSCRÍBEMEXandtheT29()function.
WordsinDIMEx100 11,575
Alternatepronunciations 2590
Wordswithgrapheme“x” 202
Wordswithgrapheme“x”intoanalternatepron. 87
Archiphonemes 45
Numberofwordsanalyzed 8738
Non-identicaltranscriptions 67
Identicaltranscriptions 8670
Percentageofidenticaltranscriptions 99.2%
Table20
ComparisonbetweentranscriptionsinDIMEx100dictionary(madebyhumans) andtheT29()function.
WordsinDIMEx100 11,575
Wordswithgrapheme“x” 289
Numberofwordsanalyzed 11,286
Non-identicaltranscriptions 1102
Identicaltranscriptions 10,184
Percentageofidenticaltranscriptions 90.23%
etal.,1990)andKaldi(Poveyetal., 2011).TheCIEMPIESS corpusisformattedtobeuseddirectlyinaSphinxsetting.Inthe caseofHTKwecreatedaseriesoftoolsthatcanreaddirectly theCIEMPIESScorpus21andfinallyforKaldiwecreatedthe settingfilesfortheCIEMPIESS.Wesetupaspeechrecognizer foreach systemusingthetrain setoftheCIEMPIESS corpus
21See the “HTK2SPHINX-CONVERTER” (Hernández-Mena & Herrera- Camacho, 2015) and the “HTK-BENCHMARK” available at http://www.
ciempiess.org/downloads.
Table21
Benchmarkamongdifferentsystems.
System WER(thelower
thebetter)
Sphinx 44.0%
HTK 42.45%
Kaldi 33.15%
Table22
Bestrecognitionresultsinlearningcurve.
Condition WER(thelower
thebetter)
T29TONICS 44.0%
T29NOTONICS 45.7%
T66TONICS 50.5%
T66NOTONICS 48.0%
andwe evaluatedtheperformanceutilizingitstestset. Every systemwereconfiguredwiththeircorrespondingdefaultparam- etersandatrigram-basedlanguagemodel.Table21showsthe performanceforeachsystem.
6.3. Tonicvowels
Given the characteristics of the CIEMPIESS corpus, we decidedtoevaluatetheeffectofthetonicvowelsmarkedinthe corpus.ForthiswetrainedfouracousticmodelsfortheSphinx system. These were tested in the corpus using the standard language model of the corpus. Table 22 presents the word errorrate(WER,thelowerthebetter)forsuchcases.Wecan observe that the distinctionof the tonicvowel helpsimprove theperformance.22However,theuseofphonetictranscriptions (T66 level of MEXBET) does have a negative effect on the performanceofthespeechrecognizer.
Usingthesameconfigurationsfoundwiththeexperimentof thetonicvowels,wecreatedfourlearningcurves.Fig.3presents thecurves,wealsocannoticethataphonetictranscription(T66)
22In Table22“TONICS”meansthatweusedtonicvowelmarksforthe recognitionexperimentand“NOTONICS”meansthatwedidnot.