B- Objetivos y Metodología
18. Los tres hermanos
We use canonical correlation analysis as a main building block of our proposed algorithm and as an important factor in the domain-pair selection experiments and analysis. In the following sections, we review regularized CCA and large-scale CCA to provide a background for the remaining chapters in the dissertation.
2.3.1 Regularized CCA
Canonical correlation analysis (CCA) is a multivariate statistical model that studies the interrelationships among sets of multiple dependent variables and multiple independent
variables. It is the most generalized member of the family of multivariate statistical tech- niques [24]. It is related to factor analysis in the sense that it creates composites of variables, and is related to discriminant analysis in finding independent dimensions for each variable set. The goal of this analysis is to produce the maximum correlation between the dimensions. As a result, canonical correlation finds the optimum structure or dimensionality of each variable set that maximizes the relationship between independent and dependent variable sets.
In other words, if we have X ∈ Rm×n and Y ∈ Rp×n, CCA finds two projection vectors wx∈ Rm and wy ∈ Rp that maximize the correlation coefficient:
ρ = w T xXYTwy q (wT xXXTwx)(wTyY YTwy) (2.1)
Since Equation 2.1 is not affected by re-scaling of wx and wy (the multiplication of these
vectors by a constant α does not change the value of ρ), we can maximize ρ as follows. max wx,wy wxTXYTwy subject to wxTXXTwx = 1, wTyY Y T wy = 1 (2.2)
It can be shown that solving Equation 2.2 is equivalent to finding the eigenvectors of top eigenvalues of the generalized eigenvalue problem in Equation2.3, in which η is the eigenvalue that corresponds to the eigenvector wx.
XYT(Y YT)−1Y XTwx = ηXXTwx (2.3)
To compute multiple projection vectors, we can solve the optimization problem in Equation
2.4, in which matrix W consists of multiple projection vectors. max W T race(W TXYT(Y YT)−1 Y XTW ) subject to WTXXTW = I (2.4)
To avoid the over-fitting of ρ and the singularity of XXT, a term λI is added to Equation
2.3. We have the constraint λ > 0 in this regularization term. Eventually, the regularized CCA attempts to solve the generalized eigenvalue problem in Equation 2.5.
XYT(Y YT)−1Y XTwx = η(XXT + λI)wx (2.5)
Sun et al. solve the regularized CCA problem, using a least squares formulation of it, with the Least Angle Regression algorithm [67].
Figure 2: L-CCA algorithm as presented in [44]
2.3.2 Large-Scale CCA
Calculating CCA can be very resource-consuming especially in the traditional approaches that should calculate QR-decompositions or singular value decomposition of large data ma- trices. To avoid these time and memory consuming operations, Lu and Foster developed an iterative algorithm that can approximate CCA on very large datasets [44]. They establish an error analysis for the case of having finite number of iterations in the algorithm and prove that the algorithm converges to the real value of CCA in case of infinite iterations.
This approach relies on LING, a gradient-based least squares algorithm that can work on large-scale matrices. As we have seen in the previous section, CCA can be computed as an iterative least squares problem. So, to compute CCA in L-CCA, first a projection of one of the data matrices on a randomly-generated small matrix is generated, to reduce the size of the matrix. Then, a QR-decomposition of this smaller matrix is calculated. After that, the CCA is calculated iteratively, by applying LING on the reduced-sized QR-decompositions of the original data matrices, in each iteration. Every time after running LING, a QR- decomposition is calculated for numerical stability. A summary of this algorithm that is presented in [44] is shown in Figure2.
The LING algorithm relies on the intuition that the projection of independent variables on the least square estimates, can be divided (column-wise) into two smaller orthogonal
components, each of which is related to the top (or bottom) singular vectors of the data. Then, it computes the first orthogonal component using randomized SVD, and the second one using gradient descent algorithm.
To be more specific, considering the least-squares problem of Y = Xβ, then Xβ∗ = X(XTX)−1XTY is the projection of Y into column space of X1. Calculating (XTX)−1 takes
a long time for a large Xn×p. To calculate Xβ∗ without the need of calculating (XTX)−1,
Lu and Foster rely on splitting the singular vectors of X.
If U1 is the top kpc singular vectors of X, and U2 is the remaining p − kpc singular vectors,
Xβ∗ can be divided into two orthogonal vectors as in Equation2.6. Then, they calculate the first term using randomized SVD, since Kpc < p. Let Yr = Y − U1U1TY . Then the second
term can be calculated using gradient descent for Yr = Xβr.
Xβ∗ = U1U1TY + U2U2TY (2.6)
Lu and Foster provide an error bound for the Ling algorithm and an error bound for L-CCA based on that in [44].
2.3.3 CCA in Recommender Systems
CCA has been used in different literature for the single-domain recommenders with various resources or to find the correlation between the content (such as text or image) of the resources in cross-domain recommender systems. To the best of our knowledge, it has not yet been used in a pure, rating-based, cross-domain collaborative filtering setting. For example, in the area of recommender systems, Faridani has used CCA to predict hotel ratings from textual comments of the hotels and their sentiment analysis [17]. Elkahky et al. use CCA as a baseline user modeling approach for their proposed recommendation system in [15]. They provide content-based cross-domain recommendations in the domains of apps, news, movies, and TV shows using a multi-view deep learning model. In [51], Ohkushi has used Kernel CCA in context-aware setting to find the relationship between music pieces and human motion to recommend music to users. Yang et al. [79] have proposed a feature learning algorithm that
uses CCA for inferring features of semantic information in the data. However, Yang et al. have not yet used their model in recommender systems.