=
0.9122 0.4098
. Their dot product is
φn(x1)Tφn(x2) = 0.8914· 0.9122 + 0.4532 · 0.4098 = 0.9988 which matches Kn(x1, x2).
If we start with the centered kernel matrix ˆK from Example 5.12, and then normalize it, we obtain the normalized and centered kernel matrix ˆKn
Kˆn=
1.00 −0.44 −0.61 0.80 −0.77
−0.44 1.00 0.98 −0.89 −0.24
−0.61 0.98 1.00 −0.97 −0.03 0.80 −0.89 −0.97 1.00 −0.22
−0.77 −0.24 −0.03 −0.22 1.00
As noted earlier, the kernel value ˆKn(xi, xj) denotes the correlation between xi and xj in feature space, i.e., it is cosine of the angle between the centered points φ(xi) and φ(xj).
5.4 Kernels for Complex Objects
We conclude this chapter with some examples of kernels defined for complex data like strings and graphs. The use of kernels for dimensionality reduction will be described in Section 7.3, for clustering in Section 13.2 and Chapter 16, and for classification in Section 21.4.
5.4.1 Spectrum Kernel for Strings
Consider text or sequence data defined over an alphabet Σ. The l-spectrum feature map is the mapping φ : Σ∗ → R|Σ|l from the set of substrings over Σ to the |Σ|l -dimensional space representing the number of occurrences of all possible substrings of length l, defined as
φ(x) =
· · · , #(α), · · ·T α∈Σl
where #(α) is the number of occurrences of the l-length string α in x.
The (full) spectrum map is an extension of the l-spectrum map, obtained by considering all lengths from l = 0 to l = ∞, leading to an infinite dimensional feature map φ : Σ∗ → R∞
φ(x) =
· · · , #(α), · · ·T α∈Σ∗
where #(α) is the number of occurrences of the string α in x.
The (l-)spectrum kernel between two strings xi, xj is simply the dot product between their (l-)spectrum maps
K(xi, xj) = φ(xi)Tφ(xj)
A naive computation of the l-spectrum kernel takes O(|Σ|l) time. However, for a given string x of length n, the vast majority of the l-length strings have an occurrence count of zero, which can be ignored. The l-spectrum map can be effectively computed in O(n) time for a string of length n (assuming n≫ l), since there can be at most n− l + 1 substrings of length l, and the l-spectrum kernel can thus be computed in O(n + m) time for any two strings of length n and m, respectively.
The feature map for the (full) spectrum kernel is infinite dimensional, but once again, for a given string x of length n, the vast majority of the strings will have an occurrence count of zero. A straightforward implementation of the spectrum map for a string x of length n can be computed in O(n2) time, since x can have at most Pn
l=1n− l + 1 = n(n + 1)/2 distinct non-empty substrings. The spectrum kernel can then be computed in O(n2+ m2) time for any two strings of length n and m, respectively. However, a much more efficient computation is enabled via suffix trees (see Chapter 10), with a total time of O(n + m).
Example 5.14: Consider sequences over the DNA alphabet Σ = {A, C, G, T }.
Let x1 = ACAGCAGT A, and let x2 = AGCAAGCGAG. For l = 3, the feature space has dimensionality|Σ|l = 43 = 64. Nevertheless, we do not have to map the input points into the full feature space; we can compute the reduced 3-spectrum mapping by counting the number of occurrences for only the length 3 substrings that occur in each input sequence, as follows
φ(x1) = (ACA : 1, AGC : 1, AGT : 1, CAG : 2, GCA : 1, GT A : 1)
φ(x2) = (AAG : 1, AGC : 2, CAA : 1, CGA : 1, GAG : 1, GCA : 1, GCG : 1) where the notation α : #(α) denotes that substring α has #(α) occurrences in xi. We can then compute the dot product by considering only the common substrings, as follows
K(x1, x2) = 1× 2 + 1 × 1 = 2 + 1 = 3
The first term in the dot product is due to the substring AGC, and the second is due to GCA, which are the only common length 3 substrings between x1 and x2.
The full spectrum can be computed by considering the occurrences of all com-mon substrings over all possible lengths. For x1 and x2, the common substrings
and their occurrence counts are given as
α A C G AG CA AGC GCA AGCA
#(α) in x1 4 2 2 2 2 1 1 1
#(α) in x2 4 2 4 3 1 2 1 1
Thus, the full spectrum kernel value is given as
K(x1, x2) = 16 + 4 + 8 + 6 + 2 + 2 + 1 + 1 = 40
5.4.2 Diffusion Kernels on Graph Nodes
Let S be some symmetric similarity matrix between nodes of a graph G = (V, E).
For instance S can be the (weighted) adjacency matrix A (4.1) or the Laplacian matrix L = A− ∆ (or its negation), where ∆ is the degree matrix for an undirected graph G, defined as ∆(i, i) = di and ∆(i, j) = 0 for all i6= j, and di is the degree of node i.
Consider the similarity between any two nodes obtained by summing the product of the similarities over paths of length 2
S(2)(xi, xj) = Xn a=1
S(xi, xa)S(xa, xj) = STi Sj
where
Si=
S(xi, x1), S(xi, x2),· · · , S(xi, xn)T
denotes the vector representing the i-th row of S (and since S is symmetric, it also denotes the i-th column of S). Over all pairs of nodes the similarity matrix over paths of length 2, denoted S(2), is thus given as the square of the base similarity matrix S
S(2)= S× S = S2
In general, if we sum up the product of the base similarities over all l-length paths between two nodes, we obtain the l-length similarity matrix S(l), which is simply the l-th power of S, i.e.,
S(l) = Sl
Power Kernels Even path lengths lead to positive semi-definite kernels, but odd path lengths are not guaranteed to do so, unless the base matrix S is itself a positive semi-definite matrix. In particular, K = S2 is a valid kernel. To see this, assume that the i-th row of S denotes the feature map for xi, i.e., φ(xi) = Si. The kernel value between any two points is then a dot product in feature space
K(xi, xj) = S(2)(xi, xj) = STiSj = φ(xi)Tφ(xj)
For a general path length l, let K = Sl. Consider the eigen-decomposition of S S= UΛUT =
Xn i=1
uiλiuTi
where U is the orthogonal matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues of S
U=
| | |
u1 u2 · · · un
| | |
Λ=
λ1 0 · · · 0 0 λ2 · · · 0 ... ... . .. 0 0 0 · · · λn
The eigen-decomposition of K can be obtained as follows K= Sl= UΛUTl
= U Λl UT
where we used the fact that eigenvectors of S and Sl are identical, and further that eigenvalues of Sl are given as (λi)l (for all i = 1,· · · , n), where λi is an eigenvalue of S. For K = Slto be a positive semi-definite matrix, all its eigenvalues must be non-negative, which is guaranteed for all even path lengths. Since (λi)l will be negative if l is odd and λi is negative, odd path lengths lead to a positive semi-definite kernel only if S is positive semi-definite.
Exponential Diffusion Kernel Instead of fixing the path length a priori, we can obtain a new kernel between nodes of a graph by considering paths of all pos-sible lengths, but by damping the contribution of longer paths, which leads to the exponential diffusion kernel, defined as
K= X∞ l=0
1 l!βlSl
= I + βS + 1
2!β2S2+ 1
3!β3S3+· · ·
= exp βS
(5.15)
where β is a damping factor, and exp{βS} is the matrix exponential. The series on the right hand side above converges for all β ≥ 0.
Substituting S = UΛUT = Pn
Thus, the eigenvectors of K are the same as those for S, whereas its eigenvalues are given as exp{βλi}, where λi is an eigenvalue of S. Further, K is symmetric since S is symmetric, and its eigenvalues are real and non-negative, since the exponential of a real number is non-negative. K is thus a positive semi-definite kernel matrix.
The complexity of computing the diffusion kernel is O(n3) corresponding to the complexity of computing the eigen-decomposition.
Von Neumann Diffusion Kernel A related kernel based on powers of S is the von Neumann diffusion kernel, defined as
K= X∞
l=0
βlSl (5.17)
where β ≥ 0. Expanding the above, we have
K= I + βS + β2S2+ β3S3+· · ·
= I + βS(I + βS + β2S2+· · · )
= I + βSK
Rearranging the terms above we obtain a closed form expression for the von Neumann kernel
K− βSK = I (I− βS)K = I
K= (I− βS)−1 (5.18)
Plugging in the eigen-decomposition S = UΛUT, and rewriting I = UUT, we have K=
UUT − U(βΛ)UT−1
=
U(I− βΛ) UT−1
= U (I− βΛ)−1UT
where (I− βΛ)−1 is the diagonal matrix whose i-th diagonal entry is (1− βλi)−1. The eigenvectors of K and S are identical, but the eigenvalues of K are given as 1/(1− βλi). For K to be a positive semi-definite kernel, all its eigenvalues should be non-negative, which in turn implies that
(1− βλi)−1≥ 0 1− βλi≤ 0
β≤ 1/λi
Furthermore, the inverse matrix (I− βΛ)−1 exists only if det(I− βΛ) =
Yn i=1
(1− βλi)6= 0
which implies that β 6= 1/λi for all i. Thus, for K to be a valid kernel, we require that β < 1/λi for all i = 1,· · · , n. The von Neumann kernel is therefore guaranteed to be positive semi-definite if |β| < 1/ρ(S), where ρ(S) = maxi{|λi|} is called the spectral radius of S, defined as the largest eigenvalue of S in absolute value.
v1
v4 v5
v3 v2
Figure 5.2: Graph Diffusion Kernel
Example 5.15: Consider the graph in Figure 5.2. Its adjacency matrix and degree matrix is given as
A=
0 0 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0
∆=
2 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 3 0 0 0 0 0 2
The negated Laplacian matrix for the graph is therefore
The eigenvalues of S are as follows
λ1 = 0 λ2=−1.38 λ3=−2.38 λ4=−3.62 λ5=−4.62 and the eigenvectors of S are
U =
0.45 −0.63 0.00 0.63 0.00 0.45 0.51 −0.60 0.20 −0.37 0.45 −0.20 −0.37 −0.51 0.60 0.45 −0.20 0.37 −0.51 −0.60 0.45 0.51 0.60 0.20 0.37
Assuming β = 0.2, the exponential diffusion kernel matrix is given as
K= exp
0.70 0.01 0.14 0.14 0.01 0.01 0.70 0.13 0.03 0.14 0.14 0.13 0.59 0.13 0.03 0.14 0.03 0.13 0.59 0.13 0.01 0.14 0.03 0.13 0.70
For the von Neumann diffusion kernel, we have
(I− 0.2Λ)−1 =
Neumann kernel is given as
K= U(I− 0.2Λ)−1UT =
0.75 0.02 0.11 0.11 0.02 0.02 0.74 0.10 0.03 0.11 0.11 0.10 0.66 0.10 0.03 0.11 0.03 0.10 0.66 0.10 0.02 0.11 0.03 0.10 0.74
5.5 Further Reading
Kernel methods have been extensively studied in machine learning and data mining.
For an in-depth introduction and more advanced topics see (Schölkopf and Smola, 2002) and (Shawe-Taylor and Cristianini, 2004). For applications of kernel methods in bioinformatics see (Schölkopf, Tsuda, and Vert, 2004).
Schölkopf, B. and Smola, A. J. (2002), Learning with Kernels: Support Vector Ma-chines, Regularization, Optimization, and Beyond, Cambridge, MA, USA: MIT Press.
Schölkopf, B., Tsuda, K., and Vert, J.-P. (2004), Kernel methods in computational biology, Cambridge, MA, USA: The MIT press.
Shawe-Taylor, J. and Cristianini, N. (2004), Kernel Methods for Pattern Analysis, New York, NY, USA: Cambridge University Press.
5.6 Exercises
Q1. Prove that the dimensionality of the feature space for the inhomogeneous poly-nomial kernel of degree q is
m =
d + q q
i xi x1 (4,2.9) x2 (2.5,1) x3 (3.5,4) x4 (2,2.1) Table 5.1: Dataset for Q2
Q2. Consider the data shown in Table 5.1. Assume the following kernel function:
K(xi, xj) =kxi− xjk2. Compute the kernel matrix K.
Q3. Show that eigenvectors of S and Sl are identical, and further that eigenvalues of Slare given as (λi)l (for all i = 1,· · · , n), where λi is an eigenvalue of S, and S is some n× n symmetric similarity matrix.
Q4. The von Neumann diffusion kernel is a valid positive semi-definite kernel if
|β| < ρ(S)1 , where ρ(S) is the spectral radius of S. Can you derive better bounds for cases when β > 0 and when β < 0.
Q5. Given the three points x1 = (2.5, 1)T, x2 = (3.5, 4)T, and x3 = (2, 2.1)T. (a) Compute the kernel matrix for the Gaussian kernel assuming that σ2 = 5.
(b) Compute the distance of the point φ(x1) from the mean in feature space.
(c) Compute the dominant eigenvector and eigenvalue for the kernel matrix above.