• No se han encontrado resultados

MEDIDA DE LA TEMPERATURA DEL SUELO.

2.2. Temperatura del suelo

2.2.3. MEDIDA DE LA TEMPERATURA DEL SUELO.

Bunescu and Pasca [2006] learn their model based on a ranking algorithm, more specifically a Ranking SVM. This algorithm was first introduced in Joachims [2002] in the context of search engine analysis and was also used in later linking approaches, for instance in Dredze et al. [2010], Zheng et al. [2010] and Pilz and Paaß [2012]. Since providing full details on SVMs and Ranking SVMs is out of the scope of this thesis, we here provide the basic ideas and briefly point out how Ranking SVMs differ from standard SVMs. We assume background knowledge on SVMs and hint the kind reader at Cortes and Vapnik [1995] or Vapnik [2000] for further details. We will again refer to Ranking SVMs in Chapter 4, where we use this algorithm for general entity linking. For all ranking and classification models trained and

3.5 Semantic Labelling of Entities

evaluated in this thesis, we use the SVMLight implementation by Thorsten Joachims1 that provides both standard classification as well as an adaption for ranking.

Now, a ranking approach for entity linking can be summarized as follows. For a mention m and a set of n candidates e(m) = {e1(m), . . . , en(m)}, the optimal result of a ranking algorithm is a ranking r∗ = {r1, . . . , rn} ∈ Rn that orders the n candidate entities e(m) according to their fitness to the mention (or the mention context). In our case, a ranking can be considered correct if the correct underlying entity e+(m) is ranked at the top position. To describe the underlying technique, we use the description as in Pilz and Paaß [2009] that closely follows that in Joachims [2002] but adapt notation.

As in Joachims [2002] we start with a collection of entities e = {e1, . . . , e|W|}. For a mention m we want to determine a list of relevant entities in e, where the most relevant entities appear first. This corresponds to a ranking relation r∗(m) ⊆ e × e that fulfills the properties of a weak ordering, i.e. asymmetric and transitive. If an entity ei is ranked higher than ej for an ordering r, i.e. ei <r ej, then (ei, ej) ∈ r, otherwise (ei, ej) 6∈ r.

We have to measure the similarity of a proposed ranking r(m) and the target ranking r∗(m). Such a measure is Kendall’s τ (Kendall [1955]) which is a function of the number ne of concordant pairs in relation to all pairs. A pair ei 6= ej is concordant if either (ei, ej) ∈ ra ∧(ei, ej) ∈ rb or (ej, ei) ∈ ra ∧(ej, ei) ∈ rb.

Now assume we have a training set D containing n different i.i.d. mentions mi with target rankings

D = (m1, r∗1), (m2, r2∗), . . . , (mn, rn∗), (3.9) where ri∗ ∈ e×e is a ranking on the entities at hand. To achieve a ranking close to the ground truth r∗, a learner will select a ranking function f (m) based on the training instance D that maximizes the empirical τD (Kendall [1955]), which measures the similarity of two rankings on the training sample, i.e.

τD(f ) = 1 n n X k=1 τ (rf (x(m,ek(m))), r ∗ k), (3.10)

where rf (x(m,ek(m))) is the ranking induced by the ranking function f and r ∗ k the target ranking.

Maximizing Eq. 3.10 is analogous to classification by minimizing training error, with the difference that the target is not a class label, but a binary ordering relation. Thus, whereas in standard SVMs constraints are formulated over the offset from a separating hyperplane, Ranking SVMs impose different constraints, since addition- ally the relative ordering of the examples has to be modelled. Consider the class of linear ranking functions

(ei, ej) ∈ fw(m) ⇐⇒ w · x(m, ei) > w · x(m, ej) (3.11)

Chapter 3 Topic Models for Person Linking

where x(m, ei) ∈ Rdis a vector of d real-valued features that for instance describe the fitness between candidate and mention and w ∈ Rd is a weight vector of matching dimension. For the class of linear ranking functions in Eq. 3.11, maximizing the number of concordant pairs, i.e. maximizing Eq. 3.10, is equivalent to finding the weight vector w so that the maximum number of the following inequalities hold:

∀(ei, ej) ∈ r∗1 : w · x(m1, ei) >w · x(m1, ej) (3.12) ..

.

∀(ei, ej) ∈ rn∗ : w · x(mn, ei) >w · x(mn, ej)

The exact solution of this problem is NP-hard. As proposed in Joachims [2002], and just like in classification SVMs, the solution is approximated by introducing non-negative slack variables ξi,j,k and minimizing the upper bound, i.e. the sum of slack variables P ξi,j,k. Regularizing the length of w to maximize margins leads to the following optimization problem:

minimize : V (w, ξ) = 1 2w · w + C |e| X i=1 |e| X j=1 n X k=1 ξi,j,k (3.13) subject to : ∀(ei, ej) ∈ r∗1 : w · x(m1, ei) ≥ w · x(m1, ej) + 1 − ξi,j,1 (3.14) .. . ∀(ei, ej) ∈ rk∗ : w · x(mk, ei) ≥ w · x(mk, ej) + 1 − ξi,j,k ∀i ∀j ∀k : ξi,j,k ≥ 0

The parameter C is the usual parameter capturing the trade-off between margin size and training error in terms of ne. As noted in Joachims [2002], this optimization problem is comparable to the ordinal regression approach in Herbrich et al. [2000]. Further, it is convex and has no local optima. By rearranging the constraints in Eq. 3.14 as

w · (x(mk, ei) − x(mk, ej)) ≥ 1 − ξi,j,k (3.15) it becomes apparent that the optimization problem is equivalent to that of a clas- sification SVM on pairwise difference vectors x(mk, ei) − x(mk, ej). Due to this similarity, it can be solved using decomposition algorithms similar to those used for SVM classification.

To formulate inference using such a ranking function, we first note that it can be shown that a learned ranking function fw∗(m) can always be represented as a linear combination of the feature vectors:

(ei, ej) ∈ fw∗(m) ⇔ w∗· x(m, ei) > w∗· x(m, ej) ⇔Xa∗k,lx(mk, el) · x(m, ei) >

X

a∗k,lx(mk, el) · x(m, ej), (3.16) where w∗ is the learned weight vector and a∗k,l are derived from the values of the Lagrangian dual variables at the solution. Further, we note that the learned ranking

3.5 Semantic Labelling of Entities

Figure 3.5: Example of two weight vectors w1 and w2 ranking four points (after Joachims [2002]). The margin δ is the distance between the closest two projections within all target rankings. For w1 and δ1, these are the points 1 and 2, for w2 and δ2 the points 1 and 4.

function fw∗(m) is here used to rank a set of candidates according to a mention m. Aiming at the candidate with highest rank, it is then sufficient to sort these candidates by their value of

rank(x(m, ei)) = w∗· x(m, ei) = X

a∗k,lx(mk, el) · x(m, ej). (3.17) The final prediction ˆe is then given by

ˆ

e = arg max ei∈e(m)

rank(x(m, ei)) = arg max ei∈e(m)

w∗· x(m, ei). (3.18) An exemplary ordering implied by a weight vector w is illustrated in Fig. 3.5 (adapted from Joachims [2002]). The figure illustrates how a weight vector w de- termines the ordering of four points in a two-dimensional example. For any weight vector w, the points are ordered by their projection onto w, which is equivalent to an ordering by the signed distance to a hyperplane with normal vector w. In the example in Fig. 3.5, this means that for w1 the points are ordered (1,2,3,4), while w2 implies the ordering (2,3,1,4) (Joachims [2002]).

While Ranking SVMs may just as standard SVMs be used with all kinds of kernels, a linear kernel has the advantage that weights of features can be directly extracted without computational effort. Bunescu and Pasca make use of this to automatically learn the threshold for a decision on NIL candidates. They have demonstrated that, using a linear kernel in the Ranking SVM, this threshold can be learned automati- cally from the weight of an indicative feature:

xnil(m, e) =1(e, NIL). (3.19)

This binary feature is active only for a NIL candidate that needs to be provided for each mention in order to learn the threshold from the available features. We may therefore create candidate sets e(m) = {ei(m)} ⊂ W ∪ {NIL} that cover

Chapter 3 Topic Models for Person Linking

all candidates in Wikipedia {ei(m)} ⊂ W and add for each mention an artificial candidate NIL.

To create training instances, we need to assign each training instance represent- ing a mention-candidate pair a ranking. For our implementation of Bunescu and Pasca’s method, we unsuccessfully tried to communicate with the authors on how these target rankings are created for the training data. Since the paper does not indicate otherwise, we assume that the ranking used in Bunescu and Pasca [2006] is a weak ordering where the correct candidate is assigned the top position and all other candidates that do not represent the ground truth entity share a place in the ordering. In practice, this ordering is realised through real-valued scalars y ∈ R. These are assigned to each vector x(m, ei) and a high value of y indicates a leading position in the ranking, a low value of y indicates a late position in the ranking. In our case, i.e. the case of a weak ordering, it suffices to chose a value y ∈ {−1, +1}. Then, for instance in the case of three candidates e1, e2 and e3 for a mention m, we have

1e1 = e+(m) : y(x(m, e1)) = +1 e2 6= e+(m) : y(x(m, e2)) = −1 e3 6= e+(m) : y(x(m, e3)) = −1

which puts x(m, e1) at the leading position and lets x(m, e2) and x(m, e3) share the same but lower position.

Having described the model designs of WTC and WCC and the learner used by Pilz and Paaß [2009] as well as by Bunescu and Pasca, we will now experimentally compare these approaches for person name disambiguation in German.