eedig fhe ESSS deSei 1999 Aa

(1)

of the

ESSLLI Student Session

1999

Amalia Todiras u (editor)

(2)

(3)

Combining Chart-Parsing and Finite State

Parsing

I. Aldezabal,K. Gojenola, M. Oronoz

InformatikaFakultatea,649 P. K.Euskal Herriko Unibertsitatea,

20080 Donostia (Euskal Herria)

E-mail: jipgogaksi.ehu.es

Abstra t

Thispaperpresentsthedevelopmentofaparsingsystemthat ombinesauni ation-basedpartial hart-parser

with nitestatete hnology. It isbeingappliedtoBasque, anagglutinativelanguage. Ina rst phase,a

uni ationgrammarisappliedtoea hsenten e,givinga hartasaresult. Thegrammarispartialandgives

agood overageofthemainelementsofthesenten e,buttheoutputisambiguous. Afterthat,theresulting

hartistreatedasanautomatontowhi hnitestatedisambiguation onstraintsandlters anbeappliedto

hoosethebestsingleanalysis. Thesystemhasbeentestedontwodierentappli ations: thea quisitionof

sub ategorizationinformationforverbs,andthedete tionofsynta ti errors.

0.1 Introdu tion

Thispaperpresentsa proje tforthedevelopment of aparsing systemthat

ombines a uni ation-based partial hart-parser with nite state te hnol-

ogy. ThesystemisbeingappliedtoBasque,anagglutinativelanguage,with

freeorder among senten e omponents.

Inarstphase,auni ationgrammarisappliedtoea hsenten e,giving

a hart as a result. The grammar is partialbut gives a omplete overage

of the main elements of the senten e, su h as noun phrases, prepositional

phrases, sentential omplements and simple senten es. It an be seen as a

shallowparser[2 ,3℄that anbeusedforsubsequentpro essing. However,it

ontainsbothmorphologi alansynta ti ambiguities,givingahugenumber

ofdierent interpretations persenten e.

After that, the resulting hart is treated as an automaton to whi h -

nite state disambiguation onstraints and lters an be applied, so that

unwanted readings or information an be dis arded or parts of a senten e

anbe hosen,dependingontheappli ation. Thesystemhasbeentestedon

two dierent appli ations: the extra tion of sub ategorization information

fora givenverb, and thedete tion of synta ti errors.

In this manner, we ombine onstru tive and redu tionisti approa hes

to parsing: inarstphase theuni ation-grammarobtainssynta ti units,

usually with many dierent interpretations, while in a se ond phase the

analysesarerestri teda ordingto the ontext andthekindofappli ation.

The remainder of this paper is organized as follows. In Se tion 2 we

presentades riptionofthe hart-parser. Se tion3des ribesthenitestate

parser and its ombination with the hart. Se tion 4 spe ies theappli a-

(4)

0.2 The hart parser

0.2.1 Previous work

Thesystemreliesondierentwide- overage toolsthathavebeendeveloped

forBasque:

The Lexi alDatabaseforBasque [4 ℄. Itisa largerepositoryof lexi al

informationwithabout70.000entries(60.000lemmas),ea hone with

its asso iatedlinguisti features: ategory,sub ategory, ase, number,

deniteness, mood and tense, among others. As this database is the

basisofthesynta ti analyzer,wemustsaythatthereisnoinformation

related to verbsub ategorization.

A morphologi al analyzer [4℄. The analyzer applies Two-Level Mor-

phology forthemorphologi al des riptionand obtains,forea h word,

itssegmentation(s)into omponentmorphemes. Themoduleisrobust

and hasfull overage of free-runningtexts inBasque.

A morphologi al disambiguator based on boththe Constraint Gram-

mar formalism [13 , 5℄ and a statisti al tagger [9℄. This tool redu es

the high word-level ambiguity from 2.65 to 1.19 interpretations, but

stillleavesa numberofdierent interpretationspersenten e.

0.2.2 The synta ti grammar

Basque being an agglutinative language, linguisti information like ase,

number and determination are given by means of morphemes appended

to the last element of a noun/prepositional phrase. Following the most

extendedlinguisti des riptionsforBasque,we tookmorphemes(seeFigure

1)asthebasi unitsuponwhi hsynta ti analysisisbased,departingfrom

thetraditionaluseofthegrammati alwordasthesynta ti unit,asisdone

innon agglutinativelanguages likeEnglish. Althoughthegureonlyshows

the ategory of ea h lexi al/synta ti unit, there is a ri h information for

ea hofthem,en odedintheformoffeaturestru tures. Moreover, afterthe

synta ti unitsareobtained, theanalysis tree isnotusedanymore.

ThePATR-IIformalismwasusedforthedenitionofthesynta ti rules.

Therewere two mainreasonsforthisele tion:

Simpli ity. Thegrammaris notlinkedto alinguisti theory,likeLFG

or HPSG,asthere hasnotbeenabroad des riptionfor Basqueusing

those formalisms [1℄. Moreover, theirappli ation would requireinfor-

mation notavailableat the moment, su h as verb sub ategorization.

PATR-II is more exibleat the ost of extra writing, as it is dened

at a lowerlevel. Aswe willexplainlater,weplanto usethisanalyzer

(5)

mendi 0 ko etxe polit 0 an mountain the of/at house nice the in noun sing genitive noun adj sing inessive

DECLENSION NMODIFIER

NP1 PP

NP2

DECLENSION

Figure 1: Parse tree for mendiko etxe politan ('in the ni e house at the

mountain')

The formalism is based on uni ation. This is useful for the ma-

nipulation of omplex linguisti stru tures for the representation of

grammati al onstituents. We must stress that the grammar uses

good linguisti granularity, in the sense that we keep all the linguis-

ti ally relevant morphosynta ti information, very ri h ompared to

most hunkingsystems. Thisfa t willenableusto useitasa general

toolfordierent appli ations.

Thegrammaratthemoment ontains120rules. Thereisanaveragenumber

of15equationsperrule,some ofthemfortesting onditionslikeagreement,

and othersforstru turebuilding. The mainphenomena overed are:

Nounphrases and prepositional phrases. Agreement among the om-

ponentelementsis veried, addedto theproperuse ofdeterminers.

Subordinate senten es, su h as sentential omplements ( ompletive

lauses,indire tquestions,...) and relative lauses.

Simplesenten esusingall thepreviouselements. Theri h agreement

betweentheverband the mainsenten e onstituents(subje t, obje t

and se ondobje t)in ase, numberand personis veried.

The omponentsfound by theparser are very reliable,as onlywell-formed

elementsareobtained,whi harene essaryforafullsenten einterpretation.

Duetothevarietyof synta ti stru turesfoundinreal senten es,thegram-

mar is limited in the sensethat onlya smallfra tion of the senten es will

re eive a full analysis. However, onstru ting a omplete grammar would

be a ostly enterprise. For that reason, this grammar is applied bottom-

up,resulting ina parser that obtainspie es of analysesor hunks, and the

resulting hart an be usedforsubsequent pro essing.

Charts have beenused before as a sour e of informationin [11 ℄, where

(6)

...

polit 0 an

mendi

0

0 ko

etxe etxe

ikusi dut nik

ko 0 0

Noun-modifier (of/at the mountain)

PP (in the nice house) PP

(in the nice house at the mountain)

S

((I) have seen (it)) NP (I) S

(I have seen (it))

Figure2: State of the hartafter theanalysisof Mendiko etxe politan ikusi

dutnik ('Ihave seen (it)intheni e houseat themountain')

is found, with the aim of orre ting a synta ti error. In fa t, this sys-

temassumed thatevery orre tsenten e getsa ompleteanalysis,whilewe

onsiderthat quite often the grammar will notbe able to parse the whole

senten e. Even inthe ase whenwe want todete t agrammati al error,we

donotassume thatthe orre ted senten e wouldgive a ompleteparse.

0.3 The Finite State Parser

In re ent years the eld of nite state systems has gained mu h attention

[12 ,13 ℄. A hara teristi thatallnitestatesystemshavein ommonisthat

theyworkonrealtextsdealingwithunitsbiggerthanthegrammati alword,

usingmore elaborated information,butwithoutrea hingthe omplexityof

fullsenten eanalysis.

Astheoutputofour hartparserisambiguous,weobtainalargenumber

ofpotentialanalyses( on atenationsof hunks overingthewholesenten e).

Weneededatoolforthesele tionofpatternsoverthenal hart,andnite

state te hnology is adequate for this task. Currently we use the Xerox

FiniteState Tool(XFST 1

). Asalinkbetweenbothparsers,the hartmust

be onvertedintoanautomaton,whi h anthenbepro essedbyXFST(see

Figure 2). In the gure, dashed linesare used to indi ate lexi al elements,

whileplainlines denesynta ti units(thoseobtained by the hart-parser,

see Figure 1). The bold ir les represent word-boundaries, and the plain

ones delimit morpheme-boundaries. As before, ea h ar in the gure is

represented by its morphosynta ti information, in the form of a sequen e

offeature-valuepairs.

Inthe dierent appli ations, we have usedsome ommonoperations:

Disambiguation. Twokindsofambiguitywere onsidered: morphosyn-

ta ti ambiguityleft by the Constraint Grammar and sto hasti dis-

ambiguation pro esses and that introdu ed by the hart parser. As

1

(7)

wholesynta ti units anbeused,thispro essissimilartothatofCG

disambiguationwiththeadvantageofbeingabletoreferen esynta ti

unitslargerthanthe word.

Filtering. Dependingonea happli ation,sometimesnotalltheavail-

able informationis relevant. For example, one diÆ ultkindof ambi-

guity,noun/adje tive,asinzuriekin (zuri='white','withthewhites'/

'with the white ones') an be ignored when a quiring verb sub ate-

gorization information, as we are mainly interested in the synta ti

ategory and thegrammati al ase (prepositionalphraseand ommi-

tative, respe tively), thesame inbothalternatives.

Extra ting partsof a senten e. Theglobal ambiguityof a senten e is

onsiderablyredu ed ifonly part of it is onsidered. Forinstan e, in

the ase of extra ting verb sub ategorization information, some rules

examine the ontext of the target verb and dene the s ope of the

subsenten eto whi htheprevious operationswillbeapplied.

Among the nite state operators used (see Figure 3 2

), we apply omposi-

tion,interse tionandunionofregularexpressionsandtransdu ers 3

. Weuse

bothordinary ompositionandthere entlydenedlenient omposition [10 ℄.

This operator allows the appli ation of dierent eliminating onstraints to

asenten e,alwayswiththe ertaintythatwhensome given onstraintelim-

inatesall the interpretations, thenthe onstraint is notappliedat all, that

is,theinterpretationsare"res ued". Astheoperatoris denedinterms of

regularrelations it an be omposed withother automata.

0.4 Appli ations

0.4.1 A quisition of sub ategorization information

We are interested in automati ally obtaining verb sub ategorization infor-

mationfrom orpora[7, 8℄. Ourobje tive is to enri h theverbentries on-

tainedinthelexi aldatabasewithinformationaboutthemain onstituents

(noun phrases, prepositional omplements and subordinate senten es) ap-

pearingwithea h verb.

The hartparseranalyzesthemainunitsinea hsenten eandthennite

state rules arry outthefollowingtasks(see Figure 3):

Re ognition of therelevant subsenten e. The mean numberof words

persenten eis22,rangingfromone tosixsubsenten es. Theproblem

isthat ofdelimiting thepart orrespondingto the target verb.

2

The .o. and & operators denote respe tively the omposition and interse tion of

regularlanguagesandrelations.

3

Infa t, Figure 3representsa simpli ationof thesystem, as ea hoperation inthe

(8)

Sentence

Chart parser

chart (automaton)

.o.

SubSentenceRecognizers .o.

DisambiguatingFilters .o.

LongestPhraseFilters .o.

OutputCleaningFilters

verb + elements No Error / Error Type(s) .o.

ErrorPattern & NotException .o.

OutputCleaningFilters

Figure 3: Dierent appli ationsof thesystem

Disambiguation. Some heuristi sareused to solve theremainingam-

biguities, related,amongothers, to dierent typesof subordination.

Sele tion of the longest NP/PPs. This heuristi proved to be very

reliable to ex lude multiplereadingsof asingle NP/PP.

Cleaning the output. The result of the previous phase ontains a

lot of morphsynta ti information, but most of it is super uous for

thisappli ation,as we aremainlyinterested inthegrammati al ase,

numberand subordinationtype of ea helement.

Thedesignofthenitestaterulesisanon-trivialtaskwhendealingwith

real texts. Nevertheless, the de larativeness of nite state tools improves

signi antly our previous version of the system implemented by means of

ad ho programs, diÆ ult to orre t and maintain. Moreover, XFST also

improves eÆ ien y. About 350 nite state denitions were made for this

appli ation.

As a rst experiment, 500 senten es for ea h of 5 dierent verbs were

sele ted. Figure 4 shows the ases that have mostlyappeared around ea h

of the sele ted verbs. When there was more than a single analysis for a

subsenten e, it was not taken into a ount. This eliminated about 5% of

thesenten es. Thisisnotaproblemaslongasthe orpusisbigenough[7℄.

Theabsolutive ase (theone that usuallyrepresentsboththeobje tof the

transitiveverbsandthesubje toftheintransitiveones)isnotin ludedsin e

(9)

0 5 10 15 20 25 30 35 40 %

agert u ( t o appear)

at era ( t o get out )

erabili ( t o use)

joan ( t o go)

ikusi ( t o s ee)

abl ala erg ine la soz t zeko

Figure4: Frequen yofelementsforea h verb

to hara terize the dierent behaviour of the verbs. On the other hand,

we nd ases that seem to have lose relationship to a given verb, su h as

the ablative ase ('from') in the atera verb ('to go out'), the alative ase

('to') in the joan verb ('to go') and the inessive ase ('in') in the agertu

verb ('to appear'). Both the ablative and alative ases usually represent

respe tively the sour e and the goal of an a tion, whi h means that both

represent a movement. As for the inessive ase, it represents the spatio-

temporal oordinates of an entity or event. That is why it appears in all

theverbs. Finally,thereare ases thatshowanex lusivetenden ytowarda

givenverb,su h asthe ompletivesubordinate-la ('that') intheikusi verb

('to see'), and theadverbial subordinate-tzeko ('for', 'so that').

Allthesepropertiesmake learthatverbstakeotherrelevant asesapart

from the ones representing the obje t and subje t of the senten es, and

thereforeit doesnotstrike usasimplausibleto assume thattheyshouldbe

taken into a ount inthe mainstru ture of the verbs. In other words, the

system anhelp usinde idingwhether agivenphrase anbe onsideredas

an argument forthetreatedverb.

Added to this experiment, we are also working on getting statisti al

patterns from thedierent ombinations ofthe elementsappearinginea h

senten e. Wewillapplythesystemto8.000senten es(about200.000words)

orresponding to 10 new verbs. In thisway, we expe t to obtain patterns

that willlet us groupall verbs indierent lasses. It will be also a wayto

verifythe suggestionsand similaritiesgot from theaboveexperiment.

Regarding evaluation, there are not already existing resour es (in the

form of annotated orpus or di tionaries) with sub ategorization informa-

tionto automati ally omparewiththeresultsofthetool, asin[7,8℄. Asa

rsttestwemanuallyevaluatedtheresultsafterapplyingourtoolto50sen-

(10)

Con erningthere ognitionofsubsenten esto whi hea hverb applies,it is

done orre tly 82% of the times. Sometimesthe subsenten e is easily re -

ognizable(forexample, whendelimitedbypun tuationmarks), whileother

ases are diÆ ult due to the nested stru ture of omplex, long senten es.

We also ounted the number of times that the verb is orre tly returned

withall its relevant synta ti omponents (without examiningtheir role as

arguments or adjun ts). This was performed orre tly in 70% of the sen-

ten es. Afterexaminingtheglobalerrorrate(30%,thatis,thetotalnumber

oferrorsin ludingthein orre tlyre ognizedsubsenten es), wefoundroom

forimprovement: halfoftheerrorswereduetothepreviousdisambiguation

pro ess (either the in orre t sele tion was hosen or there was more than

an alternative), while a third of errors were aused by the in ompleteness

of the uni ation grammar. Although these problems will always be error

sour es, we feel that after adding some simple and general linguisti rules

theywill orre t mostof theerrors.

Subsentence recognition Syntactic elements

Correct Incorrect Correct Incorrect

82% 18% 70% 30%

Table1. Resultsofthesystemapplied to50senten es.

0.4.2 Synta ti error dete tion

The system has been tested on errors in real texts written by learners of

Basque. As a preliminary experiment, we hose errors related with date-

expressionsfordierent reasons:

The ontext ofappli ation iswideand well-denedinBasque,aswell

as themost ommonerrors. In Basque,date-expressions like 'Donos-

tian,1995ekomaiatzaren15ean'(Donostia, 15thofJuly,1995)require

that some of the elements are in e ted, sometimes a ording to an-

other ontiguous element.

As it is relatively easy to obtain test data we eliminate one of the

mainproblemswhendealingwithsynta ti errorsinrealtexts: nding

erroneous senten esforea h phenomena.

Inorder to re ognize 6 dierenterror types, we needed more than 100 def-

initionsof nitestate patterns fortheir treatment. We have useda orpus

onsisting of 70 dates (in luding orre t and in orre t ones) for the de-

nitionof the patterns. Apart from the sixkinds of errors, we also tried to

a ountforthemostfrequent ombinationsoferrors. Atthemomentweare

inthe pro ess of getting new orpora with erroneous senten es for testing.

(11)

again a exibleand systemati way to solve it. We plan to extend thesys-

tem to other typesof errors, like agreement between themain omponents

ofthesenten e,whi hisvery ri hinBasque,beingasour eofmanyerrors.

0.5 Con lusions

Thisworkpresentsthedevelopment of asystemwhi h ombines:

Asynta ti grammarforBasque. It oversthemain omponentsofthe

senten e. Dueto theagglutinativenature ofBasque,theintrodu tion

of the synta ti level greatly improves the abilityto treat the wealth

of information ontained in ea h word/morpheme. The re ognized

omponents an beusedfor posteriorpro essing.

Finite state rules. They provide a modular, de larative and exible

workben htodealwiththeresulting hart,makinguseoftheavailable

informationfordierenttasks,su hassynta ti errordete tionorthe

a quisitionof sub ategorizationinformation.

This ombinationresultsvery adequate forshallowparsingappli ations

to languages like Basque, with limited grammati al resour es ompared to

otherlanguages. Thefreeorderof onstituents,amongotheraspe ts,would

makethedesignofafullgrammaravery omplextask. Moreover,thepartial

grammaris enoughfor theintended appli ations. The onsiderationof the

hart asthe interfa e betweenboth subsystems also adds to the simpli ity

ofthe ombined tool. Finally,wemustemphasizetwo main on lusions:

Thestratiedpartialparsingapproa hprovidesapowerfulwayto on-

sidersimultaneouslyinformationatmorpheme,wordandphraselevel,

adequate for agglutinative languages where the grammati al word is

notthestartingpoint ofthe analysis.

The uni ationgrammar and thenite state system are omplemen-

tary. The grammar is ne essary to treat aspe ts like omplex agree-

mentandwordordervariations, urrentlyunsolvableusingnitestate

networks, due to the exponential growth in size of the resulting au-

tomata [6℄. On the other hand, regular expressions, in the form of

automataandtransdu ers,aresuitableforoperationslikedisambigua-

tionand ltering.

A knowledgements

Thisworkwas partiallysupported bytheBasque Government,the Univer-

sityoftheBasqueCountry andtheCICYT. Thanksto GorkaElordietafor

hishelp writingthenal versionof thisdo ument.

(12)

(13)

[1℄ J.Abaitua. Complexpredi ates inBasque: from lexi alformsto fun -

tionalstru tures. PhDThesis, Univ.of Man hester, 1988.

[2℄ S.Abney.Part-of-Spee hTaggingandPartialParsing.InCorpus-Based

Methods in Language and Spee h Pro essing, Kluwer, Dordre ht, 1997.

[3℄ S. Ait-Mokhtar, J-P. Chanod. In remental Finite-State Parsing. In

Pro eedings of ANLP-97,Washington,1997.

[4℄ I.Alegria, X.Artola, K.Sarasola,M. Urkia. Automati morphologi al

analysisofBasque. InLiterary &Linguisti Computing,Vol. 11, 1996.

[5℄ I. Alegria, J. M. Arriola, X. Artola, A.Diaz de Ilarraza, K. Gojenola,

M. Maritxalar,I. Aduriz. Morphosynta ti disambiguationfor Basque

basedon theConstraintGrammar Formalism. InRANLP, 1997.

[6℄ K. Beesley. Constraining Separate Morphota ti Dependen ies in

Finite-StateGrammars. InPro eedings of the International Workshop

onFiniteStateMethodsinNatural Language Pro essing,Ankara,1998.

[7℄ M.Brent. From Grammarto Lexi on: UnsupervisedLearningof Lexi-

alSyntax. In Computational Linguisti s, vol. 19(2),1993.

[8℄ T.Bris oe,J.Carroll. Automati Extra tionofSub ategorizationfrom

Corpora. InPro eedings of ANLP-97,Washington,1997.

[9℄ N. Ezeiza, I. Alegria, J.M. Arriola, R. Urizar, I. Aduriz. Combining

Sto hasti and Rule-Based Methods forDisambiguation in Agglutina-

tiveLanguages. InPro eedings of COLING-ACL-98,Montreal, 1998.

[10℄ L.Karttunen. TheProperTreatment of OptimalityinComputational

Phonology. In Pro eedings of the International Workshop on Finite

StateMethods in Natural Language Pro essing, Ankara, 1998.

[11℄ C.Mellish.SomeChart-BasedTe hniquesforParsingIll-FormedInput.

InPro eedings of EACL-89,1989.

[12℄ E.Ro he,Y. S habes. Finite-State Language Pro essing. MIT,1997.

[13℄ A.Voutilainen. Asyntax-basedpart-of-spee hanalyser. InPro eedings

of EACL-95,Dublin,1995.

11

Pro eedingsoftheESSLLIStudentSession1999

AmaliaTodiras u(editor).

 eedig fhe ESSS deSei 1999 Aa

mendi 0 ko etxe polit 0 an mountain the of/at house nice the in noun sing genitive noun adj sing inessive

DECLENSION NMODIFIER

NP1 PP

NP2

DECLENSION

...

polit 0 an

mendi

mendi

0

0

ko

etxe etxe

ikusi dut nik

ko 0 0

Noun-modifier (of/at the mountain)

PP (in the nice house) PP

(in the nice house at the mountain)

S

((I) have seen (it)) NP (I) S

(I have seen (it))

Sentence

Chart parser

chart (automaton)

.o.

SubSentenceRecognizers .o.

DisambiguatingFilters .o.

LongestPhraseFilters .o.

OutputCleaningFilters

verb + elements No Error / Error Type(s) .o.

ErrorPattern & NotException .o.

OutputCleaningFilters

0 5 10 15 20 25 30 35 40 %

agert u ( t o appear)

at era ( t o get out )

erabili ( t o use)

joan ( t o go)

ikusi ( t o s ee)

abl ala erg ine la soz t zeko

Subsentence recognition Syntactic elements

Correct Incorrect Correct Incorrect

82% 18% 70% 30%

eedig fhe ESSS deSei 1999 Aa