• No se han encontrado resultados

Adam Kaczmarek, University of Wroclaw / Wroclaw University of Technol-

ogy

Michał Marci´nczuk, Wroclaw University of Technology

In this paper we present a heuristic approach for zero subject detection in Polish texts. The algorithm makes use of premises from different sources in- cluding morphological analysis, dependency and syntactic parsing, valence dictionary. Zero subject describes phenomenon of a verb referring to a dis- course entity by omitting its subject. Detection of zero subject is important for the task of zero anaphora resolution as a preprocessing step. However this problem was not yet very widely studied for Polish language. The only known attempt was done by Kope´c [1] who presented a supervised approach using machine learning method RIPPER utilizing orthographic and morpho- logical features. The importance of zero anaphora resolution is emphasized by the fact that zero coreference is the second most frequent coreference re- lation type in KPWr corpus [2], right after the coreference of proper names. The task of zero subject detection was divided into two major parts: de- termination of verb ability to be anaphoric and determination if a verb has explicit subject. For purposes of zero subject detection we adapted defini- tions of verb and noun following the idea from [1]. Verbs can only belong to following part-of-speech classes: fin, praet, winien, and bedzie. Words be- longing to other classes, usually assumed to describe verbs, are discarded as they cannot have subject. Nouns are defined wider and contains also nu- merals, gerunds and pronouns, because they often occur to be subjects and share most grammatical properties with "standard" nouns. We also treat more complex constructions as subject candidates like conjunction of explic- itly written number with dative noun as subject for neutral verb. (eg. "...206 posłów głosowało...") We explored several lexical and grammatical proper- ties of verbs as well as utilized informations form Polish Valence Dictionary [3] in order to determine if it is possible for a verb to be zero anaphoric. Subject detection is the part where we connected results from Polish Depen- dency Parser [4], ChunkRel [5] and deterministic windowed subject search followed by verb-subject agreement checking. We also took an attempt to retag certain words originaly tagged as uknown in the testing corpus. The idea was to use morphological guesser implemented in the WCRFT tagger [6] to fill the missing morphological tags. This brought some improvement in subject detection and agreement determination what had positive effect on the final zero subject detection. The experiments were performed on the Polish Coreference Corpus [7] in order to provide direct comparison to MentionDetector [1] which is a state-of-art method for zero verb detection for Polish. The corpus was divided into two parts (development and test). The development part was used to develop the algorithm and to improve the rules on the basis of error analysis. The test part was utilized to compare the performance of the algorithm with MentionDetector. Our algorithm out- performed MentionDetector in means of F-score by more than 8 percentage points, having much greater recall (increase by ca. 22 percentage points)

and slightly lower precision (decrease by ca. 5 percentage points). The re- call is much more important than precision as the zero subject detecion is a preprocessing step for zero anaphora resolution and some false positives might be discarded on succeeding processing steps. The also report results obtained on the KPWr corpora. The resulting algorithm was implemented as module of Liner2 framework [8] called Minos.

References

[1] Kope´c, M.: Zero subject detection for Polish. In: Proceedings of the 14th Confer- ence of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, Gothenburg, Sweden, Association for Computational Linguistics (2014) 221–225

[2] Broda, B., Marci´nczuk, M., Maziarz, M., Radziszewski, A., Wardy´nski, A.: Kpwr: Towards a free corpus of polish. In Chair), N.C.C., Choukri, K., Declerck, T., Do˘gan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., eds.: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, European Language Resources Association (ELRA) (may 2012)

[3] Adam Przepiórkowski, Elżbieta Hajnicz, Agnieszka Patejuk, Marcin Woli´nski, Filip Skwarski, and Marek ´Swidzi´nski.: Walenty: Towards a comprehensive va- lence dictionary of Polish. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), European Language Re- sources Association (ELRA) (2014)

[4] Wróblewska, A.: Polish Dependency Bank. Linguistic Issues in Language Tech- nology 7(1) (2012)

[5] Radziszewski, A., Orłowicz, P., Broda, B.: Classification of predicate- argument relations in Polish data. In Mieczysław A. Kłopotek, Jacek Koronacki, M.M.A.M.S.W., ed.: Language Processing and Intelligent Information Systems — 20th International Conference, IIS 2013, Warsaw, Poland, June 17-18, 2013. Proceedings. Volume 7912 of Lecture Notes in Computer Science. (2013) [6] Radziszewski, A.: A tiered CRF tagger for Polish. In Bembenik, R., Skonieczny,

Ł., Rybi´nski, H., Kryszkiewicz, M., Niezgódka, M., eds.: Intelligent Tools for Building a Scientific Information Platform. Volume 467 of Studies in Computa- tional Intelligence. Springer Berlin Heidelberg (2013) 215–230

[7] Ogrodniczuk, M., GŁowi´nska, K., Kope´c, M., Savary, A., Zawisławska, M.: Pol- ish Coreference Corpus. In Vetulani, Z., ed.: Proceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Pozna´n, Poland, Wydawnictwo Pozna´nskie, Fundacja Uniwersytetu im. Adama Mickiewicza (2013) 494–498

[8] Marci´nczuk, M., Koco´n, J., Janicki, M.: Liner2 — a customizable framework for proper names recognition for Polish. In Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M., eds.: Intelligent Tools for Building a Scien- tific Information Platform. Volume 467 of Studies in Computational Intelligence. Springer Berlin Heidelberg (2013) 231–253

Corpus-based Analysis of Czech Units Expressing Men-

Documento similar