of the
ESSLLI Student Session
1999
Amalia Todiras u (editor)
Combining Chart-Parsing and Finite State
Parsing
I. Aldezabal,K. Gojenola, M. Oronoz
InformatikaFakultatea,649 P. K.Euskal Herriko Unibertsitatea,
20080 Donostia (Euskal Herria)
E-mail: jipgogaksi.ehu.es
Abstra t
Thispaperpresentsthedevelopmentofaparsingsystemthat ombinesauni ation-basedpartial hart-parser
with nitestatete hnology. It isbeingappliedtoBasque, anagglutinativelanguage. Ina rst phase,a
uni ationgrammarisappliedtoea hsenten e,givinga hartasaresult. Thegrammarispartialandgives
agood overageofthemainelementsofthesenten e,buttheoutputisambiguous. Afterthat,theresulting
hartistreatedasanautomatontowhi hnitestatedisambiguation onstraintsandlters anbeappliedto
hoosethebestsingleanalysis. Thesystemhasbeentestedontwodierentappli ations: thea quisitionof
sub ategorizationinformationforverbs,andthedete tionofsynta ti errors.
0.1 Introdu tion
Thispaperpresentsa proje tforthedevelopment of aparsing systemthat
ombines a uni ation-based partial hart-parser with nite state te hnol-
ogy. ThesystemisbeingappliedtoBasque,anagglutinativelanguage,with
freeorder among senten e omponents.
Inarstphase,auni ationgrammarisappliedtoea hsenten e,giving
a hart as a result. The grammar is partialbut gives a omplete overage
of the main elements of the senten e, su h as noun phrases, prepositional
phrases, sentential omplements and simple senten es. It an be seen as a
shallowparser[2 ,3℄that anbeusedforsubsequentpro essing. However,it
ontainsbothmorphologi alansynta ti ambiguities,givingahugenumber
ofdierent interpretations persenten e.
After that, the resulting hart is treated as an automaton to whi h -
nite state disambiguation onstraints and lters an be applied, so that
unwanted readings or information an be dis arded or parts of a senten e
anbe hosen,dependingontheappli ation. Thesystemhasbeentestedon
two dierent appli ations: the extra tion of sub ategorization information
fora givenverb, and thedete tion of synta ti errors.
In this manner, we ombine onstru tive and redu tionisti approa hes
to parsing: inarstphase theuni ation-grammarobtainssynta ti units,
usually with many dierent interpretations, while in a se ond phase the
analysesarerestri teda ordingto the ontext andthekindofappli ation.
The remainder of this paper is organized as follows. In Se tion 2 we
presentades riptionofthe hart-parser. Se tion3des ribesthenitestate
parser and its ombination with the hart. Se tion 4 spe ies theappli a-
0.2 The hart parser
0.2.1 Previous work
Thesystemreliesondierentwide- overage toolsthathavebeendeveloped
forBasque:
The Lexi alDatabaseforBasque [4 ℄. Itisa largerepositoryof lexi al
informationwithabout70.000entries(60.000lemmas),ea hone with
its asso iatedlinguisti features: ategory,sub ategory, ase, number,
deniteness, mood and tense, among others. As this database is the
basisofthesynta ti analyzer,wemustsaythatthereisnoinformation
related to verbsub ategorization.
A morphologi al analyzer [4℄. The analyzer applies Two-Level Mor-
phology forthemorphologi al des riptionand obtains,forea h word,
itssegmentation(s)into omponentmorphemes. Themoduleisrobust
and hasfull overage of free-runningtexts inBasque.
A morphologi al disambiguator based on boththe Constraint Gram-
mar formalism [13 , 5℄ and a statisti al tagger [9℄. This tool redu es
the high word-level ambiguity from 2.65 to 1.19 interpretations, but
stillleavesa numberofdierent interpretationspersenten e.
0.2.2 The synta ti grammar
Basque being an agglutinative language, linguisti information like ase,
number and determination are given by means of morphemes appended
to the last element of a noun/prepositional phrase. Following the most
extendedlinguisti des riptionsforBasque,we tookmorphemes(seeFigure
1)asthebasi unitsuponwhi hsynta ti analysisisbased,departingfrom
thetraditionaluseofthegrammati alwordasthesynta ti unit,asisdone
innon agglutinativelanguages likeEnglish. Althoughthegureonlyshows
the ategory of ea h lexi al/synta ti unit, there is a ri h information for
ea hofthem,en odedintheformoffeaturestru tures. Moreover, afterthe
synta ti unitsareobtained, theanalysis tree isnotusedanymore.
ThePATR-IIformalismwasusedforthedenitionofthesynta ti rules.
Therewere two mainreasonsforthisele tion:
Simpli ity. Thegrammaris notlinkedto alinguisti theory,likeLFG
or HPSG,asthere hasnotbeenabroad des riptionfor Basqueusing
those formalisms [1℄. Moreover, theirappli ation would requireinfor-
mation notavailableat the moment, su h as verb sub ategorization.
PATR-II is more exibleat the ost of extra writing, as it is dened
at a lowerlevel. Aswe willexplainlater,weplanto usethisanalyzer
mendi 0 ko etxe polit 0 an mountain the of/at house nice the in noun sing genitive noun adj sing inessive
DECLENSION NMODIFIER
NP1 PP
NP2
DECLENSION
Figure 1: Parse tree for mendiko etxe politan ('in the ni e house at the
mountain')
The formalism is based on uni ation. This is useful for the ma-
nipulation of omplex linguisti stru tures for the representation of
grammati al onstituents. We must stress that the grammar uses
good linguisti granularity, in the sense that we keep all the linguis-
ti ally relevant morphosynta ti information, very ri h ompared to
most hunkingsystems. Thisfa t willenableusto useitasa general
toolfordierent appli ations.
Thegrammaratthemoment ontains120rules. Thereisanaveragenumber
of15equationsperrule,some ofthemfortesting onditionslikeagreement,
and othersforstru turebuilding. The mainphenomena overed are:
Nounphrases and prepositional phrases. Agreement among the om-
ponentelementsis veried, addedto theproperuse ofdeterminers.
Subordinate senten es, su h as sentential omplements ( ompletive
lauses,indire tquestions,...) and relative lauses.
Simplesenten esusingall thepreviouselements. Theri h agreement
betweentheverband the mainsenten e onstituents(subje t, obje t
and se ondobje t)in ase, numberand personis veried.
The omponentsfound by theparser are very reliable,as onlywell-formed
elementsareobtained,whi harene essaryforafullsenten einterpretation.
Duetothevarietyof synta ti stru turesfoundinreal senten es,thegram-
mar is limited in the sensethat onlya smallfra tion of the senten es will
re eive a full analysis. However, onstru ting a omplete grammar would
be a ostly enterprise. For that reason, this grammar is applied bottom-
up,resulting ina parser that obtainspie es of analysesor hunks, and the
resulting hart an be usedforsubsequent pro essing.
Charts have beenused before as a sour e of informationin [11 ℄, where
...
polit 0 an
mendi
mendi
0
0
ko
etxe etxe
ikusi dut nik
ko 0 0
Noun-modifier (of/at the mountain)
PP (in the nice house) PP
(in the nice house at the mountain)
S
((I) have seen (it)) NP (I) S
(I have seen (it))
Figure2: State of the hartafter theanalysisof Mendiko etxe politan ikusi
dutnik ('Ihave seen (it)intheni e houseat themountain')
is found, with the aim of orre ting a synta ti error. In fa t, this sys-
temassumed thatevery orre tsenten e getsa ompleteanalysis,whilewe
onsiderthat quite often the grammar will notbe able to parse the whole
senten e. Even inthe ase whenwe want todete t agrammati al error,we
donotassume thatthe orre ted senten e wouldgive a ompleteparse.
0.3 The Finite State Parser
In re ent years the eld of nite state systems has gained mu h attention
[12 ,13 ℄. A hara teristi thatallnitestatesystemshavein ommonisthat
theyworkonrealtextsdealingwithunitsbiggerthanthegrammati alword,
usingmore elaborated information,butwithoutrea hingthe omplexityof
fullsenten eanalysis.
Astheoutputofour hartparserisambiguous,weobtainalargenumber
ofpotentialanalyses( on atenationsof hunks overingthewholesenten e).
Weneededatoolforthesele tionofpatternsoverthenal hart,andnite
state te hnology is adequate for this task. Currently we use the Xerox
FiniteState Tool(XFST 1
). Asalinkbetweenbothparsers,the hartmust
be onvertedintoanautomaton,whi h anthenbepro essedbyXFST(see
Figure 2). In the gure, dashed linesare used to indi ate lexi al elements,
whileplainlines denesynta ti units(thoseobtained by the hart-parser,
see Figure 1). The bold ir les represent word-boundaries, and the plain
ones delimit morpheme-boundaries. As before, ea h ar in the gure is
represented by its morphosynta ti information, in the form of a sequen e
offeature-valuepairs.
Inthe dierent appli ations, we have usedsome ommonoperations:
Disambiguation. Twokindsofambiguitywere onsidered: morphosyn-
ta ti ambiguityleft by the Constraint Grammar and sto hasti dis-
ambiguation pro esses and that introdu ed by the hart parser. As
1
wholesynta ti units anbeused,thispro essissimilartothatofCG
disambiguationwiththeadvantageofbeingabletoreferen esynta ti
unitslargerthanthe word.
Filtering. Dependingonea happli ation,sometimesnotalltheavail-
able informationis relevant. For example, one diÆ ultkindof ambi-
guity,noun/adje tive,asinzuriekin (zuri='white','withthewhites'/
'with the white ones') an be ignored when a quiring verb sub ate-
gorization information, as we are mainly interested in the synta ti
ategory and thegrammati al ase (prepositionalphraseand ommi-
tative, respe tively), thesame inbothalternatives.
Extra ting partsof a senten e. Theglobal ambiguityof a senten e is
onsiderablyredu ed ifonly part of it is onsidered. Forinstan e, in
the ase of extra ting verb sub ategorization information, some rules
examine the ontext of the target verb and dene the s ope of the
subsenten eto whi htheprevious operationswillbeapplied.
Among the nite state operators used (see Figure 3 2
), we apply omposi-
tion,interse tionandunionofregularexpressionsandtransdu ers 3
. Weuse
bothordinary ompositionandthere entlydenedlenient omposition [10 ℄.
This operator allows the appli ation of dierent eliminating onstraints to
asenten e,alwayswiththe ertaintythatwhensome given onstraintelim-
inatesall the interpretations, thenthe onstraint is notappliedat all, that
is,theinterpretationsare"res ued". Astheoperatoris denedinterms of
regularrelations it an be omposed withother automata.
0.4 Appli ations
0.4.1 A quisition of sub ategorization information
We are interested in automati ally obtaining verb sub ategorization infor-
mationfrom orpora[7, 8℄. Ourobje tive is to enri h theverbentries on-
tainedinthelexi aldatabasewithinformationaboutthemain onstituents
(noun phrases, prepositional omplements and subordinate senten es) ap-
pearingwithea h verb.
The hartparseranalyzesthemainunitsinea hsenten eandthennite
state rules arry outthefollowingtasks(see Figure 3):
Re ognition of therelevant subsenten e. The mean numberof words
persenten eis22,rangingfromone tosixsubsenten es. Theproblem
isthat ofdelimiting thepart orrespondingto the target verb.
2
The .o. and & operators denote respe tively the omposition and interse tion of
regularlanguagesandrelations.
3
Infa t, Figure 3representsa simpli ationof thesystem, as ea hoperation inthe
Sentence
Chart parser
chart (automaton)
.o.
SubSentenceRecognizers .o.
DisambiguatingFilters .o.
LongestPhraseFilters .o.
OutputCleaningFilters
verb + elements No Error / Error Type(s) .o.
ErrorPattern & NotException .o.
OutputCleaningFilters
Figure 3: Dierent appli ationsof thesystem
Disambiguation. Some heuristi sareused to solve theremainingam-
biguities, related,amongothers, to dierent typesof subordination.
Sele tion of the longest NP/PPs. This heuristi proved to be very
reliable to ex lude multiplereadingsof asingle NP/PP.
Cleaning the output. The result of the previous phase ontains a
lot of morphsynta ti information, but most of it is super uous for
thisappli ation,as we aremainlyinterested inthegrammati al ase,
numberand subordinationtype of ea helement.
Thedesignofthenitestaterulesisanon-trivialtaskwhendealingwith
real texts. Nevertheless, the de larativeness of nite state tools improves
signi antly our previous version of the system implemented by means of
ad ho programs, diÆ ult to orre t and maintain. Moreover, XFST also
improves eÆ ien y. About 350 nite state denitions were made for this
appli ation.
As a rst experiment, 500 senten es for ea h of 5 dierent verbs were
sele ted. Figure 4 shows the ases that have mostlyappeared around ea h
of the sele ted verbs. When there was more than a single analysis for a
subsenten e, it was not taken into a ount. This eliminated about 5% of
thesenten es. Thisisnotaproblemaslongasthe orpusisbigenough[7℄.
Theabsolutive ase (theone that usuallyrepresentsboththeobje tof the
transitiveverbsandthesubje toftheintransitiveones)isnotin ludedsin e
0 5 10 15 20 25 30 35 40 %
agert u ( t o appear)
at era ( t o get out )
erabili ( t o use)
joan ( t o go)
ikusi ( t o s ee)
abl ala erg ine la soz t zeko
Figure4: Frequen yofelementsforea h verb
to hara terize the dierent behaviour of the verbs. On the other hand,
we nd ases that seem to have lose relationship to a given verb, su h as
the ablative ase ('from') in the atera verb ('to go out'), the alative ase
('to') in the joan verb ('to go') and the inessive ase ('in') in the agertu
verb ('to appear'). Both the ablative and alative ases usually represent
respe tively the sour e and the goal of an a tion, whi h means that both
represent a movement. As for the inessive ase, it represents the spatio-
temporal oordinates of an entity or event. That is why it appears in all
theverbs. Finally,thereare ases thatshowanex lusivetenden ytowarda
givenverb,su h asthe ompletivesubordinate-la ('that') intheikusi verb
('to see'), and theadverbial subordinate-tzeko ('for', 'so that').
Allthesepropertiesmake learthatverbstakeotherrelevant asesapart
from the ones representing the obje t and subje t of the senten es, and
thereforeit doesnotstrike usasimplausibleto assume thattheyshouldbe
taken into a ount inthe mainstru ture of the verbs. In other words, the
system anhelp usinde idingwhether agivenphrase anbe onsideredas
an argument forthetreatedverb.
Added to this experiment, we are also working on getting statisti al
patterns from thedierent ombinations ofthe elementsappearinginea h
senten e. Wewillapplythesystemto8.000senten es(about200.000words)
orresponding to 10 new verbs. In thisway, we expe t to obtain patterns
that willlet us groupall verbs indierent lasses. It will be also a wayto
verifythe suggestionsand similaritiesgot from theaboveexperiment.
Regarding evaluation, there are not already existing resour es (in the
form of annotated orpus or di tionaries) with sub ategorization informa-
tionto automati ally omparewiththeresultsofthetool, asin[7,8℄. Asa
rsttestwemanuallyevaluatedtheresultsafterapplyingourtoolto50sen-
Con erningthere ognitionofsubsenten esto whi hea hverb applies,it is
done orre tly 82% of the times. Sometimesthe subsenten e is easily re -
ognizable(forexample, whendelimitedbypun tuationmarks), whileother
ases are diÆ ult due to the nested stru ture of omplex, long senten es.
We also ounted the number of times that the verb is orre tly returned
withall its relevant synta ti omponents (without examiningtheir role as
arguments or adjun ts). This was performed orre tly in 70% of the sen-
ten es. Afterexaminingtheglobalerrorrate(30%,thatis,thetotalnumber
oferrorsin ludingthein orre tlyre ognizedsubsenten es), wefoundroom
forimprovement: halfoftheerrorswereduetothepreviousdisambiguation
pro ess (either the in orre t sele tion was hosen or there was more than
an alternative), while a third of errors were aused by the in ompleteness
of the uni ation grammar. Although these problems will always be error
sour es, we feel that after adding some simple and general linguisti rules
theywill orre t mostof theerrors.
Subsentence recognition Syntactic elements
Correct Incorrect Correct Incorrect
82% 18% 70% 30%
Table1. Resultsofthesystemapplied to50senten es.
0.4.2 Synta ti error dete tion
The system has been tested on errors in real texts written by learners of
Basque. As a preliminary experiment, we hose errors related with date-
expressionsfordierent reasons:
The ontext ofappli ation iswideand well-denedinBasque,aswell
as themost ommonerrors. In Basque,date-expressions like 'Donos-
tian,1995ekomaiatzaren15ean'(Donostia, 15thofJuly,1995)require
that some of the elements are in e ted, sometimes a ording to an-
other ontiguous element.
As it is relatively easy to obtain test data we eliminate one of the
mainproblemswhendealingwithsynta ti errorsinrealtexts: nding
erroneous senten esforea h phenomena.
Inorder to re ognize 6 dierenterror types, we needed more than 100 def-
initionsof nitestate patterns fortheir treatment. We have useda orpus
onsisting of 70 dates (in luding orre t and in orre t ones) for the de-
nitionof the patterns. Apart from the sixkinds of errors, we also tried to
a ountforthemostfrequent ombinationsoferrors. Atthemomentweare
inthe pro ess of getting new orpora with erroneous senten es for testing.
again a exibleand systemati way to solve it. We plan to extend thesys-
tem to other typesof errors, like agreement between themain omponents
ofthesenten e,whi hisvery ri hinBasque,beingasour eofmanyerrors.
0.5 Con lusions
Thisworkpresentsthedevelopment of asystemwhi h ombines:
Asynta ti grammarforBasque. It oversthemain omponentsofthe
senten e. Dueto theagglutinativenature ofBasque,theintrodu tion
of the synta ti level greatly improves the abilityto treat the wealth
of information ontained in ea h word/morpheme. The re ognized
omponents an beusedfor posteriorpro essing.
Finite state rules. They provide a modular, de larative and exible
workben htodealwiththeresulting hart,makinguseoftheavailable
informationfordierenttasks,su hassynta ti errordete tionorthe
a quisitionof sub ategorizationinformation.
This ombinationresultsvery adequate forshallowparsingappli ations
to languages like Basque, with limited grammati al resour es ompared to
otherlanguages. Thefreeorderof onstituents,amongotheraspe ts,would
makethedesignofafullgrammaravery omplextask. Moreover,thepartial
grammaris enoughfor theintended appli ations. The onsiderationof the
hart asthe interfa e betweenboth subsystems also adds to the simpli ity
ofthe ombined tool. Finally,wemustemphasizetwo main on lusions:
Thestratiedpartialparsingapproa hprovidesapowerfulwayto on-
sidersimultaneouslyinformationatmorpheme,wordandphraselevel,
adequate for agglutinative languages where the grammati al word is
notthestartingpoint ofthe analysis.
The uni ationgrammar and thenite state system are omplemen-
tary. The grammar is ne essary to treat aspe ts like omplex agree-
mentandwordordervariations, urrentlyunsolvableusingnitestate
networks, due to the exponential growth in size of the resulting au-
tomata [6℄. On the other hand, regular expressions, in the form of
automataandtransdu ers,aresuitableforoperationslikedisambigua-
tionand ltering.
A knowledgements
Thisworkwas partiallysupported bytheBasque Government,the Univer-
sityoftheBasqueCountry andtheCICYT. Thanksto GorkaElordietafor
hishelp writingthenal versionof thisdo ument.
[1℄ J.Abaitua. Complexpredi ates inBasque: from lexi alformsto fun -
tionalstru tures. PhDThesis, Univ.of Man hester, 1988.
[2℄ S.Abney.Part-of-Spee hTaggingandPartialParsing.InCorpus-Based
Methods in Language and Spee h Pro essing, Kluwer, Dordre ht, 1997.
[3℄ S. Ait-Mokhtar, J-P. Chanod. In remental Finite-State Parsing. In
Pro eedings of ANLP-97,Washington,1997.
[4℄ I.Alegria, X.Artola, K.Sarasola,M. Urkia. Automati morphologi al
analysisofBasque. InLiterary &Linguisti Computing,Vol. 11, 1996.
[5℄ I. Alegria, J. M. Arriola, X. Artola, A.Diaz de Ilarraza, K. Gojenola,
M. Maritxalar,I. Aduriz. Morphosynta ti disambiguationfor Basque
basedon theConstraintGrammar Formalism. InRANLP, 1997.
[6℄ K. Beesley. Constraining Separate Morphota ti Dependen ies in
Finite-StateGrammars. InPro eedings of the International Workshop
onFiniteStateMethodsinNatural Language Pro essing,Ankara,1998.
[7℄ M.Brent. From Grammarto Lexi on: UnsupervisedLearningof Lexi-
alSyntax. In Computational Linguisti s, vol. 19(2),1993.
[8℄ T.Bris oe,J.Carroll. Automati Extra tionofSub ategorizationfrom
Corpora. InPro eedings of ANLP-97,Washington,1997.
[9℄ N. Ezeiza, I. Alegria, J.M. Arriola, R. Urizar, I. Aduriz. Combining
Sto hasti and Rule-Based Methods forDisambiguation in Agglutina-
tiveLanguages. InPro eedings of COLING-ACL-98,Montreal, 1998.
[10℄ L.Karttunen. TheProperTreatment of OptimalityinComputational
Phonology. In Pro eedings of the International Workshop on Finite
StateMethods in Natural Language Pro essing, Ankara, 1998.
[11℄ C.Mellish.SomeChart-BasedTe hniquesforParsingIll-FormedInput.
InPro eedings of EACL-89,1989.
[12℄ E.Ro he,Y. S habes. Finite-State Language Pro essing. MIT,1997.
[13℄ A.Voutilainen. Asyntax-basedpart-of-spee hanalyser. InPro eedings
of EACL-95,Dublin,1995.
11
Pro eedingsoftheESSLLIStudentSession1999
AmaliaTodiras u(editor).