In the previous section, we assume that the model parameters are given. In this section, we present the online learning algorithm for SENTI-LSSVMRAE . In order
to understand the algorithm, we need to view the model from a different angle. If the latent structural SVM is treated as the top-level classifier and the RAE is considered as the underlying hidden layers,SENTI-LSSVMRAEis a neural network.
Thus the discriminant function (5.3) is a composite function in the following form
f(· ) = fK(fK−1(...f1(· )...))
(5.10) where fK(· ) corresponds to the latent structural SVM and the functions for lower
layers indicate the composition of children vectors for each parent vertex in a representation tree. As the number of composition is proportional to the number of words in a sentence, the neural network has a deep architecture, in which all hidden layers share the same parameters.
As SENTI-LSSVMRAE can be viewed as a deep neural network, we can apply the
corresponding deep learning algorithms to learn the model (cf. Section 2.3.2). Given a set of training sentences D = {(x1, y1), . . . , (xn, yn)}, we apply the
unsupervised learning step and the supervised fine-tuning step
interchangeably to learn the model parameters θ = (β, θrae, V), where β is the
weight vector in Eq.(5.3), θrae = (W, b, ˜b)denotes the parameters of RAE and V
is the set of latent feature vectors for all words. For a training sentence, the unsupervised learning step is carried out by minimizing the reconstruction errors (5.8) after creating each new vertex in the representation tree. To gain robustness against noise, we proactively corrupt the distributed representation of each leaf by randomly omitting 20% units, which can be regarded as applying the denoising autoencoder (Vincent et al., 2008). In the supervised fine-tuning step, we aim to solve the following optimization problem:
θ∗ =arg min θ 1 n n X i=1 [ max (^y,^h)∈Y(x)×H(x) (β>Φ(x, ^y, ^h) + δ(^y, ^h, y)) − max ¯h∈H(x)β > Φ(x, y, ¯h) + γ1|θrae| + X v∈V γ2|v| + γ3|β|] (5.11)
EMRG ( ¯h, y) with gold standard edge labels y and an EMRG (^h, ^y) with inferred labeled edges ^y and textual evidences ^h. Since we expect sparse parameterization of RAE (cf. Section 5.4.2) and certain sparse non-latent features such as dependency paths, we apply L1 regularizers to θrae, the word
vectors and the weight vector β, where the degree of sparsity is controlled by the hyperparameter γ1, γ2and γ3respectively.
Because both training criteria involve non-differentiable L1 norms, we apply the online forward-backward splitting (FOBOS) algorithm (Duchi and Singer, 2009a), which can be viewed as a combination of the stochastic gradient descent (SGD) (Bottou, 2003) and the projected subgradient method (Calamai and Moré, 1987). In particular, the online FOBOS is applied in a mini-batch setting that it considers k training instances each time to compute gradients for updating the model parameters θ. On iteration t, two steps are required to update the parameters:
θt+1 2 = θt− εtOt (5.12) θt+1 = arg min θ 1 2kθ − θtk 2 + εtγ|θ| (5.13)
where Ot is the gradient computed for k instances without considering the L1
regularizers and εt is the learning rate. The step (5.12) takes exactly the same
form as the weight updating formula of SGD and Ot is computed by using
backpropagation (Rumelhart et al., 2002). Since the unsupervised and the supervised steps are taken interchangeably, the gradient are computed based on both training criterions. In addition, the projection step (5.13) performs L1 regularization and finds a sparse solution of model parameters. Then the two steps are repeated through the training data several times until the convergence condition is met.
Minimizing reconstruction errors requires only direct application of the FOBOS algorithm, whereas the supervised fine-tuning involves two inference problems. For a labeled sentence x, the gradient Ot of Eq.(5.11) in the step (5.12) takes the
form Ot = ∂β>Φ(x, ^y∗, ^h∗) ∂θ − ∂β>Φ(x, y, ¯h∗) ∂θ
where the feature functions of the corresponding EMRGs are inferred by solving (^y∗, ^h∗) = arg max
(^h,^y)∈H(x)×Y(x)
[β>Φ(x, ^y, ^h) + δ(^y, ^h, y)]
and
(y, ¯h∗) =arg max
¯h∈H(x)
β>Φ(x, y, ¯h) as indicated in the optimization problem (5.11).
The former inference problem is similar to the one we considered in the previous section except the inclusion of the loss function. It in fact finds the most error-prone EMRG according to the loss between an EMRG (h, y) and a gold standard EMRG. Then the objective function of the ILP program becomes
max
z∈B s
>
z + δ(^h, ^y, y)
and the loss is the sum of per-relationship costs for the sake of easy computation. δ(^h, ^y, y) = X
e∈E0
ϕeze
where E0 is the set of all candidates of mention-based relationship and ϕe is the
misclassification cost of the corresponding relationship e. Since we aim to maximize the F-Measure of SRG, which involves edges with non-other
relationship types, the errors by classifying other as non-other relationship types should be weighted smaller than misclassifying non-other relationship types. Therefore, ϕecould be one of the two costs ϕfp and ϕfn, which are fixed
for misclassifying other and non-other relationship types respectively.
In addition, since the non-positive weights of relationship labels in the initial learning phase often lead to EMRGs with few edges, which results in too few error cases, we fix it by adding a constraint for the minimal number of edges in
an EMRG, X
e∈A
X
e∈Ce
ηec ≥ ζ (5.14)
where A is the set of all relationship candidates, Ceis the candidate set for textual
evidence of the relationship e, and ζ denotes the lower bound of edges.
Empirically, we find the best way to determine ζ is to make it equal to the maximal number of edges in an EMRG with the restriction that a textual
evidence can be assigned to at most one relationship candidate. Hence we represent all the relationship candidates A and all the textual evidence candidates C as two vertex sets in a bipartite graph ^G = hV = (A, C), Ei (with edges in E indicating which textual evidence can be assigned to which relationship candidate). Then ζ corresponds to exactly the size of a maximum matching of the bipartite graph, which is computed by the Hopcroft-Karp algorithm (Hopcroft and Karp, 1973) in our implementation.
To find the optimal EMRG ( ¯h∗, y), we consider the following set of constraints
for inference since the labels of the edges are known for the training data. For an edge e with the gold label k, we have
X c∈Ce ηec ≤ 1; ηec≤ lck X ^ k∈L lc^k≤ 1; X e∈Bc ηec ≤ 1
We include also the soft constraints to avoid a textual evidence being overly reused by multiple relationships. In addition, we found it useful to assume a minimal number of edges labeled with non-other relationship types by
X
a∈Ar
X
c∈Ca
ηac≥ ζr (5.15)
where Ar is the set of all non-other relationships and ζr denotes the minimal
number of such edges, computed in the same way as for the constraint (5.14).