3. Estudio funcional de proteínas que interaccionan con elementos del RaV
3.3 Estudio funcional de la proteína hnRNPK
In this section, we build on the work by Golub and van Loan (1996) an propose an iterative approach for solving the optimization problem in Eq. (2.12) that has the quadratic runtime cost per iteration (with respect to the number of instances). The approach is based on the conjugate gradient descent method (Golub and van Loan, 1996) for solving linear systems of equations defined with symmetric and positive definite matrices. First, we describe (in Section 2.5.1.1) a procedure for an approximate computation of the smallest value of the Lagrange multiplier satisfying the stationary constraints from Eq. (2.16). The procedure is based on the conjugate gradient descent method and has the quadratic runtime cost in the number of instances. For the optimal value of the Lagrange multiplier, the optimal solution to problem in Eq. (2.12) is the solution of the following linear system
(S − µminI) z = b . (2.24)
As discussed in Section 2.4, the matrix S is symmetric and µmin< σn≤ σn−1≤ ··· ≤ σ1. From
here it then follows that the matrix P = (S − µminI) is symmetric and positive definite. Hence, we can apply the conjugate gradient descent method (Section 10.2, Golub and van Loan, 1996) to iteratively solve this system with the quadratic cost per iteration. In Section 2.5.1.2, we provide a brief review of this method and present a theoretical guarantee on the quality of the solution obtained in this way. In our review of the approach, we follow closely the exposition by Golub and van Loan (Chapter 10, 1996).
2.5.1.1 Iterative Computation of the Lagrange Multiplier
In this section, we propose a mean to approximate the optimal Lagrange multiplier (defining the linear system in Eq. 2.24) in large scale problems. In order to compute the multiplier, we first need to derive the open interval containing this root of the secular equation. As shown in Section 2.4.2, the optimal multiplier lies in the open interval determined by the smallest
2.5 Large Scale Approximations 31 eigenvalue of the matrix S. To obtain the smallest eigenvalue of the matrix S, we propose to use the power iteration algorithm (Golub and van Loan, 1996) which has the quadratic runtime cost per iteration. However, as we need the smallest eigenvalue and the power iteration algorithm computes the largest one, we apply the algorithm to the matrix −S.
Having computed the smallest eigenvalue of the matrix S, we have determined the interval of the secular root corresponding to the optimal Lagrange multiplier. In order to compute this multiplier we form a slightly different version of the secular equation,
g (µ) = z>(S − µI)−2z− R2.
In our empirical evaluations (Section 2.9), the iterative algorithm described in Section 2.4.3 proved to be very fast and always converged in few iterations to machine precision. To apply this algorithm with the conjugate gradient descent method and without an eigendecompo- sition of S, we need to be able to derive the coefficients, ptand qt(t > 0), of the surrogate
quadratic function (see Section 2.4.3). For this, we need to be able to evaluate the secular equation and its derivative at any iteration. The first is simple to achieve using the conjugate gradient descent algorithm from the previous section. In particular, for the derivative of the secular equation at an estimate µt of µminwe have
g0(µt) = 2z>µt(S − µtI)−1zµt ,
where zµt is the solution of the linear system Pµtz = b with Pµt = S − µtI, obtained using the conjugate gradient descent method. Thus, by applying the conjugate gradient descent method one more time to solve the linear system Pµtˆz = zµt, one obtains the gradient of the secular equation at µt. The described procedure has quadratic runtime complexity stemming from the cost per iteration of the conjugate gradient descent method. Hence, for low-rank kernel matrices (or matrices with a fast decaying spectrum) we can use this approach to compute an approximation of the optimal multiplier for problem (2.12) in O(n2) time.
2.5.1.2 Conjugate Gradient Descent
This section reviews the conjugate gradient descent approach (Chapter 10, Golub and van Loan, 1996) in the context of Section 2.4 and the optimization problem in Eq. (2.24). The approach is based on the observation that solving the linear system, P z = b, is equivalent to minimizing the quadratic form
Φ(z) =1
2z>P z− b>z .
The fact that P is a symmetric and positive definite matrix implies that the minimal value of Φ (z) is attained by setting z = P−1b. Thus, the simplest iterative method for solving the
linear system in Eq. (2.24) is the gradient descent approach. The negative gradient of the quadratic form at the step t is given by the residual at that step, i.e.,
rt= b − P zt = −∇Φ (zt) .
If the residual vector is non-zero then there exists a positive constant τ ∈ R+ such that
zt+1 = zt+ τrt and Φ (zt+1) < Φ (zt). While simple and easy to implement, the gradient descent method can be inefficient when the condition number κ(P ) =σ1−µmin/σn−µminis large.
To avoid this issue, the conjugate gradient descent method minimizes the quadratic form Φ (z) along a set of linearly independent directions {gi}ti=1 that do not necessarily
correspond to residuals {ri}ti=1, with t = 1,2,...,n. The convergence is guaranteed in at most
n steps because that is the dimension of the problem and a solution can be written as a
linear combination of at most n linearly independent vectors. Similar to Golub and van Loan (1996), let us first consider the choice of a direction gt. For this purpose, let us now take (we subsequently show that this can always be done)
zt= z0+ Gt−1ξ + τgt,
where Gt−1 is a matrix with columns {gi}t−1i=1, ξ ∈ Rt−1, and τ ∈ R. Then, we have that
Φ(zt) = Φ (z0+ Gt−1ξ + τgt) = Φ(z0+ Gt−1ξ) + τξ>G>t−1P gt+τ2 2 gt>P gt+ τgt>(P z0− b) = Φ(z0+ Gt−1ξ) + τξ>G>t−1P gt+τ 2 2 gt>P gt− τgt>r0.
If gt ⊥ span({P g1, . . . , P gt−1}) then ξ>G>t−1P gt = 0 and the search for zt splits into two
independent optimization problems, min z∈ z0+span({g1,...,gt}) Φ(z) = min ξ∈Rt−1, τ∈R Φ(z0+ Gt−1ξ + τgt) = argmin ξ∈Rt−1, τ∈R Φ(z0+ Gt−1ξ) +τ22gt>P gt− τgt>r0= min ξ∈Rt−1 Φ(z0+ Gt−1ξ) + minτ∈R τ2 2 gt>P gt− τgt>r0 ! .
From here it then follows that the solution to the first optimization problem minimizes the quadratic form over z0+ span({g1, . . . , gt−1}). On the other hand, the optimal solution to the
second problem is τt= gt>r0
gt>P gt. Moreover, the fact that gt⊥ span({P g1, . . . , P gt−1}) implies
gt>rt−1= −gt>(P zt−1− b) = −gt>(P z0+ P Gt−1ξ− b) = gt>r0.
Thus, direction gt should be chosen so that gt ⊥ span{P g1, . . . , P gt−1} and gt>rt−1 , 0. In Golub and van Loan (Section 10.2, 1996), the authors show that such conjugate directions can always be selected by setting
gt= rt−1+ πtgt−1.
Multiplying the latter equation with g>
t−1P from the left and using the fact that the vectors
P gt−1and gtare mutually orthogonal we obtain that
πt = −g > t−1P rt−1
gt−1> P gt−1 .
Hence, the conjugate gradient descent can be performed by setting
zt= zt−1+ τtgt = zt−1+ gt>r0 gt>P gt(rt−1+ πtgt−1) = zt−1+ gt>rt−1 gt>P gt rt−1− gt−1> P rt−1 gt−1> P gt−1gt−1 ! .
The conjugate gradient descent iteration in this form requires three matrix-vector multiplica- tions. This is computationally inefficient and it can be improved by observing that
2.5 Large Scale Approximations 33 From here it then follows that
krt−1k2= rt−1> rt−1= rt−1> rt−2− τt−1rt−1> P gt−1.
Noting that r>
t−1rt−2= 0 (e.g., see Theorem 10.2.3 in Golub and van Loan, 1996) we get
krt−1k2= −τt−1rt−1> P gt−1.
On the other hand, from the definition of τt−1it follows that
gt−1> rt−2= gt−1> r0= τt−1gt−1> P gt−1.
The latter expression implies that we can express πtas
πt = krt−1k2
gt−1> rt−2 .
Hence, we can now give a conjugate gradient descent iteration that requires only one matrix- vector multiplication, zt = zt−1+g > t rt−1 gt>P gt rt−1 + krt−1k2 gt−1> rt−2gt−1 ! .
Having given an iterative solution that requires a single matrix-vector multiplication and, thus, has the quadratic runtime cost per iteration, we now review the theoretical properties of the method. First, we present a worst case bound on the approximation error of the approach expressed in terms of the number of iterations and condition number of the matrix defining the linear system in Eq. (2.24).
Theorem 2.2. (Luenberger, 1973) AssumeP ∈ Rn×nis a symmetric and positive definite matrix andb∈ Rn. If the conjugate gradient descent method produces iterates{z
i} and κ = κ (P ) then kz∗− ztkP ≤ 2 √ κ− 1 √ κ + 1 !t kz∗− z0kP , wherez∗= P−1b andkzk2 P = z>P z.
Corollary 2.3. The approximation error of the conjugate gradient descent method satisfies zt− P−1b ≤ 2 √ κ √ κ− 1 √ κ + 1 !t z0− P−1b .
Proof. This corollary is formulated as a self-study problem in Golub and van Loan (Problem 10.2.8, 1996). In order to show this claim, let us first observe that
zt− P−1b 2 P =zt− P−1b > Pzt− P−1b = P 1/2z t− P−1b 2 .
For the resulting expression, using the properties of the operator norm, we obtain √σ n− µmin zt− P−1b ≤ P 1/2z t− P−1b ≤ √σ 1− µmin zt− P−1b . Hence, from Theorem 2.2 and the latter inequality it follows that
√ σn− µmin zt− P−1b ≤ kz∗− ztkP ≤ 2 √ κ− 1 √ κ + 1 !t kz∗− z0kP ≤ 2√σ1− µmin √ κ− 1 √ κ + 1 !t z0− P−1b .
From these two bounds, we conclude that the conjugate gradient descent method con- verges fast, i.e., in a small number iterations, for well-conditioned matrices. Thus, for knowledge-based kernel principal component analysis with a well-conditioned matrix P the approach can provide an efficient approximation of the optimal solution for the optimization of a quadratic form over a hypersphere of constant radius (described in Section 2.4). Beside these two results, Golub and van Loan (1996) give an upper bound on the number of required iterations for matrices that can be written as a sum of the identity and a low-rank matrix. The following theorem states that result more formally.
Theorem 2.4. (Golub and van Loan, 1996) Assume thatP = I + P ∈ Rn×nis a symmetric and positive definite matrix andrankP = r. Then, the conjugate gradient descent method converges in at mostr + 1 steps.
Thus, for low-rank kernel matrices the conjugate gradient descent method can provide an effective approximation of the optimal solution defining the knowledge-based kernel principal components. Having reviewed this approach and theoretical results giving insights into its effectiveness, we proceed to the next section where we derive knowledge-based kernel principal components using an approximate low-rank factorization of a kernel matrix.