• No se han encontrado resultados

APROXIMACIÓN HISTÓRICA

COLONIZACIÓN ROMANACOLONIZACIÓN ROMANA

11. COLONIZACIÓN ROMANA

11.1. APROXIMACIÓN HISTÓRICA

The individual studies themselves are structured as follows:

1) Paper I is a first part of the study to explore the relation between abstract syntax trees (AST) and dependency structures in the UD scheme. This paper extensively studies the connection between the two in addition to proposing a generalized transducer to translate ASTs to dependency trees.

2) Paper II is the second part of the study: where the inverse problem i.e. translating dependency trees to abstract syntax trees is studied. The algorithm presented here is a restricted version of the generalized method discussed in Section1.4. The algorithm works on a large fragment of structures found in the UD treebanks. 3) Papers III and IV both cover the topic of dependency parsing, in related with

Universal Dependencies as the target scheme.

4) While Paper III focuses on the task of de-lexicalized parsing using synthetic UD treebanks, paper IV focuses on addressing one specific short-coming of dependency parsers: parsing sentences with out-of-vocabulary words.

Paper IV is a stand-alone study on symbolic dependency parsers and ways to improve them, but also explores initial steps towards improving robustness of depen- dency parsers and parsers in general. The work does set the tone for Paper III- which shares the objective of improving robustness of parsers, albeit for GF grammars using a completely different approach. The rest of this chapter summarizes the individual contributions of each of these studies.

1.7.1

Paper I: gf2ud

From Abstract Syntax to Universal Dependencies

This paper presents a conversion method from abstract syntax trees to dependency trees. This is done in two steps: by proposing a general algorithm that builds dependency trees for a given interlingual grammar in GF (ast2dep), and applying this algorithm to convert GF-RGL trees to Universal Dependencies (UD). One of the aims of the study is to precisely describe the correspondence between the two multilingual abstractions, namely GF-RGL grammars and UD. The correspondence between GF-RGL and UD turns out to be good, and the relatively few discrepancies give rise to interesting questions about universality.

The conversion also has applications: (a) to bootstrap parallel UD treebanks from GF treebanks; (b) it defines a formal encoding of the annotation guidelines of UD, in terms of functions in the GF-RGL grammar. (c) it makes information from UD treebanks available for the construction of ASTs (d) it gives a method to check the consistency of manually annotated UD trees with respect to the annotation schemes ; The conversion is tested and evaluated by bootstrapping two small treebanks for 31 languages, as well as comparing a GF version of the English Penn treebank with the UD version. In the first case, the bootstrapped treebanks are evaluated in terms of % of labelled edges, while the Penn treebank is compared against a UDv1 treebank obtained using the Stanford dependency converter on the Penn treebank. Furthermore, the work in this paper serves as a pre-cursor to the work in Paper II and Paper III, and leaves some unexplored directions for future research.

1.7. SUMMARY OF THE STUDIES 29

1.7.2

Paper II: ud2gf

From Universal Dependencies to Abstract Syntax

This study is a continuation of Paper I that describes the relation between ASTs and dependency trees. This paper attempts to invert the mapping: take dependency trees from standard UD treebanks and reconstruct AST trees from them. The primary aim of this method is to help GF-based interlingual translation by providing a robust, efficient front end – as a substitute for the exact parsing used in GF. However, since UD trees are based on natural (as opposed to generated) data and built manually or by machine learning (as opposed to rules), the conversion is not trivial.

As I mentioned above, this work uses both insights and artifacts (mainly the dependency configurations) from Study I as a starting point. The study provides a stand-alone description of a basic algorithm, essentially focussed on inverting the conversion i.e. dep2ast. This method enables covering around 70% of nodes, and the rest can be covered by approximative backup strategies. Analyzing the reasons of the incompleteness reveals structures missing in GF grammars, but also some problems in UD treebanks.

Extensions to the core algorithm and improvements of the results presented in this study have already been described in Section1.4.2.

1.7.3

Paper III: Bootstrapping UD treebanks

Bootstrapping UD treebanks for Delexicalized Parsing

Standard approaches to treebanking traditionally employ a waterfall model (Som- merville, 2010), where annotation guidelines guide the annotation process and insights from the annotation process in turn lead to subsequent changes in the annotation guide- lines. This process remains a very expensive step in creating linguistic resources for a target language, necessitates both linguistic expertise and manual effort to develop the annotations and is subject to inconsistencies in the annotation due to human errors. In this paper, we propose an alternative approach to treebanking – one that requires writing grammars. This approach is motivated specifically in the context of Universal Dependencies, an effort to develop uniform and cross-lingually consistent treebanks across multiple languages.

We show here that a bootstrapping approach to treebanking via interlingual gram- mars is plausible and useful in a process where grammar engineering and treebanking are jointly pursued when creating resources for the target language. We demonstrate the usefulness of synthetic treebanks in the task of delexicalized parsing. Our experiments reveal that simple models for treebank generation are cheaper than human annotated treebanks, especially in the lower ends of the learning curves for delexicalized parsing, which is relevant in particular in the context of low-resource languages.

1.7.4

Paper IV: OOV words in Dependency parsing

Replacing OOV Words For Dependency Parsing with Distributional Semantics Lexical information is an important feature in dependency parsers – both in the case of stand-alone parsing given a tagged sequence and in pipeline systems where other components like part-of-speech (POS) taggers rely on forms of the lexical items. However, there is no such information available for out-of-vocabulary (OOV) words,

which causes many classification errors. In this study, we propose a method to address this shortcoming: replacing OOV words with known, in-vocabulary words that are similaraccording to different notions of similarity. Specifically, we study in detail two such notions: semantic and morphological similarity. The replacement candidates are obtained using distributional similar words computed from a large background corpus, as well as morphologically similar according to common suffixes.

Extensive experiments are done to cover different design parameters: using both transition-based and graph-based symbolic dependency parsers; count-based and dense neural vector-based semantic models for distributional similarity and a set of typologically diverse languages for each of the two similarity heuristics. We show performance differences both for count-based and dense neural vector-based semantic models using the proposed technique. Further, we discuss the interplay of POS and lexical information for dependency parsing and provide a detailed analysis and a discussion of results: while we observe significant improvements for count-based methods, neural vectors do not increase the overall accuracy.