REVIEWS - REVISTA DE DIDÁCTICAS ESPECÍFICAS

As we have seen in the discussion about scree plots and their use in selecting the number of components, more often than not there is no ‘elbow’. Another popular way of choosing the number of PCs is the so-called 95 per cent rule. The idea is to pick the indexκ such that the firstκ PC scores contribute 95 per cent to total variance. Common to both ways of choosing the number of PCs is that they are easy, but both are ad hoc, and neither has a mathematical foundation.

In this section I explain the Probabilistic Principal Component Analysis of Tipping and Bishop (1999), which selects the number of components in a natural way, namely, by maximising the likelihood.

Tipping and Bishop (1999) consider a model-based framework for the population in which the d-dimensional random vector X satisfies

X= AZ + μ + , (2.31)

where Z is a p-dimensional random vector with p≤ d, A is a d × p matrix, μ ∈ R^d is the mean of X, and is d-dimensional random noise. Further, X, Z and satisfy G1 to G4.

G1: The vector is independent Gaussian random noise with ∼N^(0, σ²I_d_×d).

G2: The vector Z is independent Gaussian with Z∼N^{(0, I}p×p).

G3: The vector X is multivariate normal with X∼N⁽μ, AA^T+ σ²I_d_×d).

G4: The conditional random vectors X|Z and Z|X are multivariate normal with means and covariance matrices derived from the means and covariance matrices of X, Z and. In this framework, A, p and Z are unknown. Without further assumptions, we cannot find all these unknowns. From a PC perspective, model (2.31) is interpreted as the approximation of the centred vector X− μ by the first p principal components, and is the error arising from this approximation. The vector Z is referred to as the unknown source or the vector of latent variables. The latent variables are not of interest in their own right for this analysis;

instead, the goal is to determine the dimension p of Z. The term latent variables is used in Factor Analysis. We return to the results of Tipping and Bishop in Section7.4.2and in particular in Theorem7.6.

Principal Component Analysis requires no assumptions about the distribution of the ran-dom vector X other than finite first and second moments. If Gaussian assumptions are made

2.8 Principal Component Analysis, the Number of Components and Regression 63 or known to hold, they invite the use of powerful techniques such as likelihood methods.

For dataX satisfying (2.31) and G1 to G4,Tipping and Bishop(1999) show the connection between the likelihood of the data and the principal component projections. The multivariate normal likelihood function is defined in (1.16) in Section1.4. Tipping and Bishop’s key idea is that the dimension of the latent variables is the value p which maximises the likelihood.

For details and properties of the likelihood function, see chapter 7 ofCasella and Berger (2001).

Theorem 2.26 [Tipping and Bishop(1999)] Assume that the dataX =

X₁X₂···Xn

sat-isfy (2.31) and G1 to G4. Let S be the sample covariance matrix of X, and let ^Tbe the spectral decomposition of S. Let pbe the diagonal p× p matrix of the first p eigenvalues

λj of S, and let pbe the d× p matrix of the first p eigenvectors of S.

Putθ = (A,μ,σ ). Then the log-likelihood function of θ given X is log L(θ|X) = −n

d log (2π) + log[det(C)] + tr(C⁻¹S)

# , where

C= AA^T+ σ²I_d_×d, and L is maximised by

A_{(ML, p)}= p( p− σ_{(ML, p)}² Ip×p)¹^/2, σ(ML, p)² = 1 d− p

∑

j>pλj and μ = X. (2.32) To find the number of latent variables, Tipping and Bishop propose to use

p^∗= argmax log L(p).

In practice, Tipping and Bishop find the maximiser p^∗ of the likelihood as follows: Deter-mine the principal components ofX. For p < d, use the first p principal components, and calculateσ_{(ML, p)}² and A_{(ML, p)}as in (2.32). Put

C_p= A_{(ML, p)}A^T_{(ML, p)}+ σ_{(ML, p)}² I_d_×d, (2.33) and use C_pinstead of C in the calculation of the log likelihood. Once the log likelihood has been calculated for all values of p, find its maximiser p^∗.

How does the likelihood approach relate to PCs? The matrix A in the model (2.31) is not specified. For fixed p≤ d, the choice based on principal components is A = ^−1/2p ^Tp. With this A and ignoring the error term, the observations Xi and the p-dimensional vectors Z_i are related by

Zi = ^−1/2_p ^TpXi for i= 1,...,n.

From W^{( p)}_i = ^T_p(X_i− X), it follows that

Zi = ^−1/2_p W^{( p)}_i + ^−1/2_p ^TpX,

so the Z_i are closely related to the PC scores. The extra term ^−1/2p ^TpX accounts for the lack of centring in Tipping and Bishop’s approach. The dimension which maximises the likelihood is therefore a natural candidate for the dimension of the latent variables.

Our next result, which follows from Theorem 2.26, explicitly shows the relationship between C_pof (2.33) and S.

Corollary 2.27 AssumeX =

X₁X₂···Xn

and S satisfy the assumptions of Theorem2.26.

Let A_{(ML, p)}andσ(ML, p)be the maximisers of the likelihood as in (2.32). Then, for p≤ d,

The proof of the corollary is deferred to the Problems at the end of Part I. The corollary states that S and C_p agree in their upper p× p submatrices. In the lower right part of C_p, the eigenvalues of S have been replaced by sums, thus C_pcould be regarded as a stabilised or robust form of S.

A similar framework to that ofTipping and Bishop(1999) has been employed byMinka (2000), who integrates Bayesian ideas and derives a rigorous expression for an approximate likelihood function.

We consider two data sets and calculate the number of principal components with the method of Tipping and Bishop (1999). One of the data sets could be approximately Gaussian, whereas the other is not.

Example 2.17 We continue with the seven-dimensionalabalonedata. The first eigenvalue contributes more than 97 per cent to total variance, and the PC₁ data (not shown) look reasonably normal, so Tipping and Bishop’s assumptions apply.

In the analysis I show results for the raw data only; however, I use all 4,177 observations, as well as subsets of the data. In each case I calculate the log likelihood as a function of the index p at the maximiser θp = ( A_{(ML, p)},μ, σ(ML, p)) as in Theorem2.26and (2.33).

Figure2.20shows the results with p ≤ 6 on the x-axis. Going from the left to the right panels, we look at the results of all 4,177 observations, then at the first 1,000, then at the first 100, and finally at observations 1,001 to 1,100. The range of the log likelihood – on the y-axes – varies as a consequence of the different values of n and actual observations.

In all four plots, the maximiser p^∗= 1. One might think that this is a consequence of the high contribution to variance of the first PC. A similar analysis based on the scaled data also leads to p^∗= 1 as the optimal number of principal components. These results suggest that because the variables are highly correlated, and because PC₁contributes 90 per cent (for the scaled data) or more to variance, PC₁captures the essence of the likelihood, and it therefore suffices to represent the data by their first principal component scores.

0 5

Figure 2.20 Maximum log likelihood versus index of PCs on the x -axis for the abalone data of Example2.17. All 4,177 observations are used in the left panel, the first 1,000 are used in the second panel, and 100 observations each are used in the two right panels.

2.8 Principal Component Analysis, the Number of Components and Regression 65 Table 2.7 Indices p^∗for the Number of PCs of the Raw and Scaled

Breast Cancer Data from Example2.18

Raw Data Scaled Data

Observations 1:569 1:300 270:569 1:569 1:300 270:569

p^∗ 29 29 29 1 1 3

The abalone data set has a relatively small number of dimensions and is approximately normal. The method ofTipping and Bishop(1999) has produced appropriate results here. In the next example we return to the thirty-dimensional breast cancer data.

Example 2.18 We calculate the dimension p^∗ for the breast cancerdata. The top right panel of Figure2.8shows that the density estimate of the scaled PC₁ scores has one large and one small mode and is right-skewed, so the PC₁ data deviate considerably from the normal distribution. The scree plot of Figure2.5does not provide any information about a suitable number of PCs. Figure2.10, which compares the raw and scaled data, shows the three large variables in the raw data.

We calculate the maximum log likelihood for 1≤ p ≤ 29, as described in Example2.17, for the scaled and raw data and for the whole sample as well as two subsets of the sample, namely, the first 300 observations and the last 300 observations, respectively. Table 2.7 shows the index p^∗for each set of observations.

Unlike the preceding example, the raw data select p^∗= 29 for the full sample and the two subsamples, whereas the scaled data select p^∗= 1 for the whole data and the first half and p^∗= 3 for the second half of the data. Neither choice is convincing, which strongly suggests that Tipping and Bishop’s method is not appropriate here.

In the last example we used non-normal data. In this case, the raw and scaled data lead to completely different best dimensions – the largest and smallest possible value of p. Although it is not surprising to obtain different values for the ‘best’ dimensions, these extremes could be due to the strong deviation from normality of the data. The results also indicate that other methods of dimension selection need to be explored which are less dependent on the distribution of the data.

In document REVISTA DE DIDÁCTICAS ESPECÍFICAS (página 135-142)