Preparando el viaje - Estambul y la Costa Turca (Antalya)

0.9122 0.4098

. Their dot product is

φ_n(x₁)^Tφ_n(x₂) = 0.8914· 0.9122 + 0.4532 · 0.4098 = 0.9988 which matches K_n(x₁, x₂).

If we start with the centered kernel matrix ˆK from Example 5.12, and then normalize it, we obtain the normalized and centered kernel matrix ˆK_n

Kˆ_n=







1.00 −0.44 −0.61 0.80 −0.77

−0.44 1.00 0.98 −0.89 −0.24

−0.61 0.98 1.00 −0.97 −0.03 0.80 −0.89 −0.97 1.00 −0.22

−0.77 −0.24 −0.03 −0.22 1.00







As noted earlier, the kernel value ˆK_n(x_i, x_j) denotes the correlation between x_i and x_j in feature space, i.e., it is cosine of the angle between the centered points φ(xi) and φ(xj).

5.4 Kernels for Complex Objects

We conclude this chapter with some examples of kernels deﬁned for complex data like strings and graphs. The use of kernels for dimensionality reduction will be described in Section 7.3, for clustering in Section 13.2 and Chapter 16, and for classiﬁcation in Section 21.4.

5.4.1 Spectrum Kernel for Strings

Consider text or sequence data deﬁned over an alphabet Σ. The l-spectrum feature map is the mapping φ : Σ^∗ → R^|Σ|^l from the set of substrings over Σ to the |Σ|^l -dimensional space representing the number of occurrences of all possible substrings of length l, deﬁned as

φ(x) =

· · · , #(α), · · ·T α∈Σ^l

where #(α) is the number of occurrences of the l-length string α in x.

The (full) spectrum map is an extension of the l-spectrum map, obtained by considering all lengths from l = 0 to l = ∞, leading to an inﬁnite dimensional feature map φ : Σ^∗ → R^∞

φ(x) =

· · · , #(α), · · ·T α∈Σ^∗

where #(α) is the number of occurrences of the string α in x.

The (l-)spectrum kernel between two strings x_i, x_j is simply the dot product between their (l-)spectrum maps

K(xi, xj) = φ(xi)^Tφ(xj)

A naive computation of the l-spectrum kernel takes O(|Σ|^l) time. However, for a given string x of length n, the vast majority of the l-length strings have an occurrence count of zero, which can be ignored. The l-spectrum map can be eﬀectively computed in O(n) time for a string of length n (assuming n≫ l), since there can be at most n− l + 1 substrings of length l, and the l-spectrum kernel can thus be computed in O(n + m) time for any two strings of length n and m, respectively.

The feature map for the (full) spectrum kernel is inﬁnite dimensional, but once again, for a given string x of length n, the vast majority of the strings will have an occurrence count of zero. A straightforward implementation of the spectrum map for a string x of length n can be computed in O(n²) time, since x can have at most Pn

l=1n− l + 1 = n(n + 1)/2 distinct non-empty substrings. The spectrum kernel can then be computed in O(n²+ m²) time for any two strings of length n and m, respectively. However, a much more eﬃcient computation is enabled via suﬃx trees (see Chapter 10), with a total time of O(n + m).

Example 5.14: Consider sequences over the DNA alphabet Σ = {A, C, G, T }.

Let x1 = ACAGCAGT A, and let x2 = AGCAAGCGAG. For l = 3, the feature space has dimensionality|Σ|^l = 4³ = 64. Nevertheless, we do not have to map the input points into the full feature space; we can compute the reduced 3-spectrum mapping by counting the number of occurrences for only the length 3 substrings that occur in each input sequence, as follows

φ(x1) = (ACA : 1, AGC : 1, AGT : 1, CAG : 2, GCA : 1, GT A : 1)

φ(x2) = (AAG : 1, AGC : 2, CAA : 1, CGA : 1, GAG : 1, GCA : 1, GCG : 1) where the notation α : #(α) denotes that substring α has #(α) occurrences in x_i. We can then compute the dot product by considering only the common substrings, as follows

K(x1, x2) = 1× 2 + 1 × 1 = 2 + 1 = 3

The ﬁrst term in the dot product is due to the substring AGC, and the second is due to GCA, which are the only common length 3 substrings between x₁ and x₂.

The full spectrum can be computed by considering the occurrences of all com-mon substrings over all possible lengths. For x₁ and x₂, the common substrings

and their occurrence counts are given as

α A C G AG CA AGC GCA AGCA

#(α) in x₁ 4 2 2 2 2 1 1 1

#(α) in x₂ 4 2 4 3 1 2 1 1

Thus, the full spectrum kernel value is given as

K(x₁, x₂) = 16 + 4 + 8 + 6 + 2 + 2 + 1 + 1 = 40

5.4.2 Diffusion Kernels on Graph Nodes

Let S be some symmetric similarity matrix between nodes of a graph G = (V, E).

For instance S can be the (weighted) adjacency matrix A (4.1) or the Laplacian matrix L = A− ∆ (or its negation), where ∆ is the degree matrix for an undirected graph G, deﬁned as ∆(i, i) = di and ∆(i, j) = 0 for all i6= j, and di is the degree of node i.

Consider the similarity between any two nodes obtained by summing the product of the similarities over paths of length 2

S⁽²⁾(x_i, x_j) = Xn a=1

S(x_i, x_a)S(x_a, x_j) = S^T_i S_j

where

S_i=

S(x_i, x₁), S(x_i, x₂),· · · , S(xi, x_n)T

denotes the vector representing the i-th row of S (and since S is symmetric, it also denotes the i-th column of S). Over all pairs of nodes the similarity matrix over paths of length 2, denoted S⁽²⁾, is thus given as the square of the base similarity matrix S

S⁽²⁾= S× S = S²

In general, if we sum up the product of the base similarities over all l-length paths between two nodes, we obtain the l-length similarity matrix S^(l), which is simply the l-th power of S, i.e.,

S^(l) = S^l

Power Kernels Even path lengths lead to positive semi-deﬁnite kernels, but odd path lengths are not guaranteed to do so, unless the base matrix S is itself a positive semi-deﬁnite matrix. In particular, K = S² is a valid kernel. To see this, assume that the i-th row of S denotes the feature map for x_i, i.e., φ(x_i) = S_i. The kernel value between any two points is then a dot product in feature space

K(x_i, x_j) = S⁽²⁾(x_i, x_j) = S^T_iS_j = φ(x_i)^Tφ(x_j)

For a general path length l, let K = S^l. Consider the eigen-decomposition of S S= UΛU^T =

Xn i=1

u_iλ_iu^T_i

where U is the orthogonal matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues of S



 | | |

u₁ u₂ · · · un

| | |



 Λ=







λ₁ 0 · · · 0 0 λ₂ · · · 0 ... ... . .. 0 0 0 · · · λn







The eigen-decomposition of K can be obtained as follows K= S^l= UΛU^Tl

= U Λ^l U^T

where we used the fact that eigenvectors of S and S^l are identical, and further that eigenvalues of S^l are given as (λ_i)^l (for all i = 1,· · · , n), where λi is an eigenvalue of S. For K = S^lto be a positive semi-definite matrix, all its eigenvalues must be non-negative, which is guaranteed for all even path lengths. Since (λ_i)^l will be negative if l is odd and λ_i is negative, odd path lengths lead to a positive semi-definite kernel only if S is positive semi-definite.

Exponential Diffusion Kernel Instead of ﬁxing the path length a priori, we can obtain a new kernel between nodes of a graph by considering paths of all pos-sible lengths, but by damping the contribution of longer paths, which leads to the exponential diffusion kernel, deﬁned as

K= X∞ l=0

1 l!β^lS^l

= I + βS + 1

2!β²S²+ 1

3!β³S³+· · ·

= exp βS

(5.15)

where β is a damping factor, and exp{βS} is the matrix exponential. The series on the right hand side above converges for all β ≥ 0.

Substituting S = UΛU^T = Pn

Thus, the eigenvectors of K are the same as those for S, whereas its eigenvalues are given as exp{βλi}, where λi is an eigenvalue of S. Further, K is symmetric since S is symmetric, and its eigenvalues are real and non-negative, since the exponential of a real number is non-negative. K is thus a positive semi-deﬁnite kernel matrix.

The complexity of computing the diﬀusion kernel is O(n³) corresponding to the complexity of computing the eigen-decomposition.

Von Neumann Diffusion Kernel A related kernel based on powers of S is the von Neumann diffusion kernel, deﬁned as

K= X∞

l=0

β^lS^l (5.17)

where β ≥ 0. Expanding the above, we have

K= I + βS + β²S²+ β³S³+· · ·

= I + βS(I + βS + β²S²+· · · )

= I + βSK

Rearranging the terms above we obtain a closed form expression for the von Neumann kernel

K− βSK = I (I− βS)K = I

K= (I− βS)⁻¹ (5.18)

Plugging in the eigen-decomposition S = UΛU^T, and rewriting I = UU^T, we have K=

UU^T − U(βΛ)U^T₋₁

U(I− βΛ) U^T₋₁

= U (I− βΛ)⁻¹U^T

where (I− βΛ)⁻¹ is the diagonal matrix whose i-th diagonal entry is (1− βλi)⁻¹. The eigenvectors of K and S are identical, but the eigenvalues of K are given as 1/(1− βλi). For K to be a positive semi-deﬁnite kernel, all its eigenvalues should be non-negative, which in turn implies that

(1− βλi)⁻¹≥ 0 1− βλi≤ 0

β≤ 1/λi

Furthermore, the inverse matrix (I− βΛ)⁻¹ exists only if det(I− βΛ) =

Yn i=1

(1− βλi)6= 0

which implies that β 6= 1/λi for all i. Thus, for K to be a valid kernel, we require that β < 1/λ_i for all i = 1,· · · , n. The von Neumann kernel is therefore guaranteed to be positive semi-deﬁnite if |β| < 1/ρ(S), where ρ(S) = maxi{|λi|} is called the spectral radius of S, deﬁned as the largest eigenvalue of S in absolute value.

v₁

v4 v5

v₃ v₂

Figure 5.2: Graph Diﬀusion Kernel

Example 5.15: Consider the graph in Figure 5.2. Its adjacency matrix and degree matrix is given as







0 0 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0







∆=







2 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 3 0 0 0 0 0 2







The negated Laplacian matrix for the graph is therefore

The eigenvalues of S are as follows

λ₁ = 0 λ₂=−1.38 λ₃=−2.38 λ₄=−3.62 λ₅=−4.62 and the eigenvectors of S are

U =

0.45 −0.63 0.00 0.63 0.00 0.45 0.51 −0.60 0.20 −0.37 0.45 −0.20 −0.37 −0.51 0.60 0.45 −0.20 0.37 −0.51 −0.60 0.45 0.51 0.60 0.20 0.37







Assuming β = 0.2, the exponential diﬀusion kernel matrix is given as

K= exp

0.70 0.01 0.14 0.14 0.01 0.01 0.70 0.13 0.03 0.14 0.14 0.13 0.59 0.13 0.03 0.14 0.03 0.13 0.59 0.13 0.01 0.14 0.03 0.13 0.70







For the von Neumann diﬀusion kernel, we have

(I− 0.2Λ)⁻¹ =

Neumann kernel is given as

K= U(I− 0.2Λ)⁻¹U^T =







0.75 0.02 0.11 0.11 0.02 0.02 0.74 0.10 0.03 0.11 0.11 0.10 0.66 0.10 0.03 0.11 0.03 0.10 0.66 0.10 0.02 0.11 0.03 0.10 0.74







5.5 Further Reading

Kernel methods have been extensively studied in machine learning and data mining.

For an in-depth introduction and more advanced topics see (Schölkopf and Smola, 2002) and (Shawe-Taylor and Cristianini, 2004). For applications of kernel methods in bioinformatics see (Schölkopf, Tsuda, and Vert, 2004).

Schölkopf, B. and Smola, A. J. (2002), Learning with Kernels: Support Vector Ma-chines, Regularization, Optimization, and Beyond, Cambridge, MA, USA: MIT Press.

Schölkopf, B., Tsuda, K., and Vert, J.-P. (2004), Kernel methods in computational biology, Cambridge, MA, USA: The MIT press.

Shawe-Taylor, J. and Cristianini, N. (2004), Kernel Methods for Pattern Analysis, New York, NY, USA: Cambridge University Press.

5.6 Exercises

Q1. Prove that the dimensionality of the feature space for the inhomogeneous poly-nomial kernel of degree q is

m =

d + q q

i x_i x₁ (4,2.9) x₂ (2.5,1) x₃ (3.5,4) x₄ (2,2.1) Table 5.1: Dataset for Q2

Q2. Consider the data shown in Table 5.1. Assume the following kernel function:

K(x_i, x_j) =kxi− xjk². Compute the kernel matrix K.

Q3. Show that eigenvectors of S and S^l are identical, and further that eigenvalues of S^lare given as (λi)^l (for all i = 1,· · · , n), where λi is an eigenvalue of S, and S is some n× n symmetric similarity matrix.

Q4. The von Neumann diﬀusion kernel is a valid positive semi-deﬁnite kernel if

|β| < _ρ(S)¹ , where ρ(S) is the spectral radius of S. Can you derive better bounds for cases when β > 0 and when β < 0.

Q5. Given the three points x₁ = (2.5, 1)^T, x₂ = (3.5, 4)^T, and x₃ = (2, 2.1)^T. (a) Compute the kernel matrix for the Gaussian kernel assuming that σ² = 5.

(b) Compute the distance of the point φ(x₁) from the mean in feature space.

Chapter 6

In document Estambul y la Costa Turca (Antalya) (página 23-37)