RECOMENDACIONES DE LA DEFENSORÍA DE LOS HABITANTES DE COSTA RICA.

REGULACION JURIDICA SOBRE DERECHOS DE LOS PUEBLOS INDÍGENAS EN COSTA RICA

C. RECOMENDACIONES DE LA DEFENSORÍA DE LOS HABITANTES DE COSTA RICA.

The general idea of the parser is to represent each arc in a fully connected graph as a binary variable of an integer linear program, each of which associated with a score from a statistical model. A set of constraints imposes the formal properties of a dependency tree such that finding the highest-scoring combination of arcs under this set of constraints outputs the best-scoring dependency tree. Typologically, this parser belongs to the graph- based paradigm as it performs global optimization to find the best spanning tree over a given set of tokens.

6.1 Parsing with Morphosyntactic Constraints 95

Formally, let T be the set of tokens in a given sentence and T0 = T ∪ {ROOT} be the set

of tokens including a special root token. Furthermore, let L be the set of dependency relations or dependency labels. A and U then define two index sets, one for labeled and one for unlabeled arcs, respectively.

A :={ hh, d, li | h ∈ T0, d ∈ T, l ∈ L, h 6= d } (6.3)

U :={ hh, di | h ∈ T0, d ∈ T, h 6= d } (6.4)

A dependency tree is an indicator vector of binary variables,

y := hyaia∈A (6.5)

where ya= 1means that arc a is in the parse, otherwise ya= 0. We define Y to be the set

of all well-formed dependency trees (projective and non-projective).

The basic parser assumes an arc-factored model (McDonald et al. 2005), in which the score of a dependency tree is defined as the sum of the scores of the individual arcs in that tree. The objective function of the integer linear program is thus to find the combination of arcs that has the highest sum of arc scores

ˆ y = arg max y∈Y X a∈A ya(w · φARC(a)) (6.6)

with w being the weight vector and φARCbeing a function that represents arcs as feature vectors. Solving the integer linear program finds the ˆy that maximizes the objective function given a set of additional constraints.

In order to formulate the set of constraints that force the variables in ˆyto form a dependency tree, we need some auxiliary definitions. First, we define the set of potential incoming and outgoing arcs for each token (one version each for labeled and unlabeled

96 6 Morphosyntax with Symbolic Constraints arcs): Ain_h := { hg, h, li | g ∈ T0, l ∈ L, g 6= h, hg, h, li ∈ A } (6.7) Aout_h := { hh, d, li | d ∈ T, l ∈ L, h 6= d, hg, h, li ∈ A } (6.8) U_hin:= { hg, hi | g ∈ T0, g 6= h, hg, hi ∈ U } (6.9) U_hout := { hh, di | d ∈ T, h 6= d, hg, hi ∈ U } (6.10) It holds that Ain

h ⊂ A, Aouth ⊂ A, Uhin⊂ U , and Uhout ⊂ U .

With these subsets, we can define the first property of a dependency tree, namely that each token has exactly one head

a∈Ain t

ya= 1 for all t ∈ T (6.11)

Note that this constraint is defined as in Martins et al. (2009) but it ranges over labeled arcs. There is no need to change anything because also for labeled arcs, only one of them can be active at any given time. Furthermore, Martins et al. (2009) have an additional constraint that states that the root node does not have a head. Here, this constraint is not necessary because the index set of arcs does not contain incoming arcs to the root (Equation (6.3)).

For the second property, acyclicity, Martins et al. (2009) employ a single-commodity flow formulation (Magnanti and Wolsey 1995). The idea is to enforce acyclicity by enforcing connectedness with the root for each token. The root sends units of flow along the arcs of the dependency tree and each token consumes one unit of this flow. A token can only consume its unit of flow if there is a path from the root to this token in the tree. Since each token can only have a single head (Equation (6.11)), only acyclic trees can fulfill this condition for each token simultaneously. This is because any cycle necessarily disconnects the nodes in the cycle from the rest of the tree and then there will be no path from the root node to the nodes in the cycle.

To model the flow, we need an additional set of variables. The flow variables are repre- sented as a vector of integer variables, one for each unlabeled arc in the graph. The flow variables are not restricted to binary values because their value represents the flow on the respective arc:

6.1 Parsing with Morphosyntactic Constraints 97

The flow variables are connected to the labeled arcs in the graph via an inequality. Only if one of the labeled arcs between a pair of dependent and head is active, the flow variable can carry flow:

|T |X

l∈L

yhh,d,li≥ fhh,di for all h ∈ T0, d ∈ T (6.13)

Note that because of the single-head constraint, the value of the sum on the left-hand side can be either 1 or 0.

The root node is treated differently than the other nodes. Acylicity is modeled by ensuring that there is a path from the root node to every other node in the tree. Since every node is supposed to consume one unit of flow, the flow on the outgoing arcs of the root node is set to the number of other nodes (i.e., the number of tokens in the sentence), which means it sends |T | units:

u∈Uout

ROOT

fu = |T | (6.14)

Every other token consumes one unit of flow, i.e., the difference between the incoming flow and the outgoing flow is one.

X u∈Uin t fu− X u∈Uout t fu = 1 for all t ∈ T (6.15) ROOT A B C D E 3 2 1 1 1

(a)In well-formed dependency trees, flow flows from root to each of the nodes.

ROOT A B C D E 5 1+? 1+? ? ?

(b)Cycles disconnect the structure such that flow cannot reach all nodes.

Figure 6.2:Schema of how flow constraints prevent cycles.

Figure 6.2 illustrates the idea of Martins et al. (2009) of using single-commodity flow to enforce acyclicity. Two structures are shown, one that fulfills the constraints (Figure 6.2a) and one that does not (Figure 6.2b). In Figure 6.2a, the root node sends 5 units of flow (distributed over the outgoing arcs as 3+2), one for each other node in the tree. This is

98 6 Morphosyntax with Symbolic Constraints

the content of Equation (6.14). Every other node consumes one unit of flow such that the difference between the flow on the incoming arcs and the flow on the outgoing arcs is 1. This is the content of Equation (6.15). Figure 6.2b now shows a structure that violates Equation (6.15) in several places. The cycle that is formed by nodes B and E makes it impossible to fulfill the constraint simultaneously for both nodes because the incoming flow of B should have a value that is one larger than the outgoing flow, but the same is supposed to hold for E. Furthermore, the root node sends 5 units of flow to node A which distributes them to its dependents. However, since nodes C and D have no outgoing arcs, the difference between the incoming and the outgoing flow is going to be bigger than one. We see then that cyclic structures cannot fulfill the flow constraints and are therefore excluded as valid output structures of the parser.2

Equations (6.6), (6.11) and (6.13) to (6.15) plus the domain restrictions

y ∈ Bd, f ∈ Zd (6.16)

form an integer linear program and represent a first-order graph-based dependency parser. Finding the best solution of this integer linear program solves the same task as running the Chu-Liu-Edmonds algorithm (Chu and Liu 1965, Edmonds 1967) as proposed in McDonald et al. (2005). It can be considerably slower than Chu-Liu-Edmonds since solving an integer linear program is of exponential complexity in the worst case. However, unlike the Chu-Liu-Edmonds algorithm, it allows us to add additional conditions to the constraint set, for example to model second-order features but also constraints to model morphosyntax (see Section 6.1.3).

Second-order Parsing

Martins et al. (2009) add several additional constraints to the basic formulation to facilitate second-order features. In our parser, we adopt two of them, namely what they call all siblings and all grandchildren. The idea in both cases is to introduce an auxiliary variable for each pair of arcs, e.g., two arcs with the same head (siblings). This auxiliary variable is coupled with the respective arc variables and is only active if both of the arcs of the pair

2_{It should be noted that the flow is used as a metaphor in this formulation. The constraint solver searches} for a structure that fulfills all constraints simultaneously and there is no distribution of flow values in cyclic trees that would do so. But there is not actually anything that flows.

6.1 Parsing with Morphosyntactic Constraints 99

are active. If it is active, it contributes its weight to the total score of the tree.

We define index sets for sibling and grandchildren pairs of arcs (S and G, respectively):

S :={ hh, d, si | h ∈ T0, d ∈ T, s ∈ T, h 6= d, h 6= s, d 6= s } (6.17)

G :={ hg, h, di | g ∈ T0, h ∈ T, s ∈ T, g 6= h, h 6= d, g 6= d } (6.18)

Note that we do not include labels in second-order features.

In the following, we demonstrate the all sibling formulation. The grandchildren work completely analogously. First, we define the binary variables, one for each sibling factor:

s := hsaia∈S (6.19)

A set of three constraints for each sibling factor couples the variable with the two arcs in the dependency tree:

shh,d,si≤ X l∈L yhh,d,li (6.20) shh,d,si≤ X l∈L yhh,s,li (6.21) shh,d,si≥ X l∈L yhh,d,li+ X l∈L yhh,s,li− 1 (6.22)

Equations (6.20) and (6.21) ensure that the sibling factor cannot be active if one of the arc variables is not active. Equation (6.22) states that the sibling factor must be active if both of the arc variables are active. The equations loop over the labels for each arc because the actual arc label is unknown at this point. The single head constraint (Equation (6.11)) guarantees that at most one of the arcs will be active, i.e., the sums over labels in these equations will always evaluate to either 0 or 1.

The objective function changes accordingly to also factor over siblings

ˆ y = arg max y∈Y X a∈A ya(w · φARC(a)) + X a∈S sa(ws· φSIB(a)) (6.23)

with wsrepresenting the weights for siblings and φSIBbeing the feature function for sibling factors. As before, the constraint solver searches for the solution with the highest overall

100 6 Morphosyntax with Symbolic Constraints

score, which now also includes scores from sibling factors.

Inference, Learning, and Feature Model

Inference is performed using a general purpose constraint solver for linear programs, in our case the GUROBI constraint solver.3 _{Since solving integer linear programs is}

exponentially hard in the general case, we follow Martins et al. (2009) and solve the relaxation of the problem, which means dropping the integer constraint on the variables. If the solver outputs an integer solution, as it often does, then this solution is guaranteed to be optimal. If the solution is not integer, we use the first-order formulation as defined above to project the fractional solution to integer space using the fractional values of the variables as arc weights. To further reduce the complexity of the problem, the graphs are pruned before parsing by choosing for each token the ten most probable heads using a linear classifier that is not restricted by structural requirements. We also use an arc filter that blocks arcs that do not occur in the training data based on the label of the arc and the part-of-speech tag of the head and the dependent.

The weight vector is trained with loss-augmented passive-aggressive online learning (Crammer et al. 2003) and averaged afterwards (Freund and Schapire 1999). The feature model was modeled after mate parser’s but is not identical. The full set is described in Appendix A.

In document El tratamiento del imputado indígena en el sistema represivo costarricense (página 36-38)