CONCLUSIONES - REVISTA DE DIDÁCTICAS ESPECÍFICAS

We begin with summary statistics that are available from an analysis of the covariance matrix.

Definition 2.4 Let X∼ (μ,). Let r be the rank of , and for k ≤ r, let λk be the eigenval-ues of. For κ ≤ r, let W⁽^κ)be theκth principal component vector. The proportion of total variance or the contribution to total variance explained by the kth principal component score W_k is

λk

∑^rj=1λj = λk

tr ().

The cumulative contribution to total variance of theκ-dimensional principal component vector W⁽^κ)is

∑^κk=1λk

∑^rj=1λj = ∑^κ^k⁼¹λk

tr () .

A scree plot is a plot of the eigenvaluesλk against the index k. For data, the (sample) proportion of and contribution to total variance are defined analogously using the sample covariance matrix S and its eigenvalues λk.

It may be surprising to use the term variance in connection with the eigenvalues of or S. Theorems2.5and2.6establish the relationship between the eigenvaluesλk and the variance of the scores W_kand thereby justify this terminology.

Scree is the accumulation of rock fragments at the foot of a cliff or hillside and is derived from the Old Norse word skr¯itha, meaning a landslide, to slip or to slide – seePartridge

Table 2.4 Variables of the Swiss bank notes data from Example2.5

1 Length of the bank notes

2 Height of the bank notes, measured on the left 3 Height of the bank notes, measured on the right 4 Distance of inner frame to the lower border 5 Distance of inner frame to the upper border 6 Length of the diagonal

(1982). Scree plots, or plots of the eigenvalues λk against their index k, tell us about the distribution of the eigenvalues and, in light of Theorem2.5, about the decrease in variance of the scores. Of particular interest is the ratio of the first eigenvalue to the trace of or S.

The actual size of the eigenvalues may not be important, so the proportion of total variance provides a convenient standardisation of the eigenvalues.

Scree plots may exhibit an elbow or a kink. Folklore has it that the indexκ at which an elbow appears is the number of principal components that adequately represent the data, and thisκ is interpreted as the dimension of the reduced or principal component data. However, the existence of elbows is not guaranteed. Indeed, as the dimension increases, elbows do not usually appear. Even if an elbow is visible, there is no real justification for using its index as the dimension of the PC data. The words knee or kink also appear in the literature instead of elbow.

Example 2.5 The Swiss bank notes data of Flury and Riedwyl (1988) contain six vari-ables measured on 100 genuine and 100 counterfeit old Swiss 1,000-franc bank notes. The variables are shown in Table2.4.

A first inspection of the data (which I do not show here) reveals that the values of the largest variable are 213 to 217 mm, whereas the smallest two variables (4 and 5) have val-ues between 7 and 12 mm. Thus the largest variable is about twenty times bigger than the smallest.

Table2.1shows the eigenvalues and eigenvectors of the sample covariance matrix which is given in (2.5). The left panel of Figure2.4shows the size of the eigenvalues on the y-axis against their index on the x -axis. We note that the first eigenvalue is large compared with the second and later ones.

The lower curve in the right panel shows, on the y-axis, the contribution to total variance, that is, the standardised eigenvalues, and the upper curve shows the cumulative contribu-tion to total variance – both as percentages – against the index on the x -axis. The largest eigenvalue contributes well over 60 per cent of the total variance, and this percentage may be more useful than the actual size of λ1. In applications, I recommend using a combination of both curves as done here.

For these data, anelbow at the third eigenvalue is visible, which may lead to the con-clusion that three PCs are required to represent the data. This elbow is visible in the lower curve in the right subplot but not in the cumulative upper curve.

Our second example looks at two thirty-dimensional data sets of very different origins.

2.4 Visualising Principal Components 29

1 2 3 4 5 6

0 0.5 1 1.5 2 2.5 3

1 2 3 4 5 6

0 20 40 60 80 100

Figure 2.4 Swiss bank notes of Example2.5; eigenvalues (left) and simple and cumulative contributions to variance (right) against the number of PCs, given as percentages.

0 0.2 0.4

5 10 15 20 25 30

20 60 100

Figure 2.5 Scree plots (top) and cumulative contributions to total variance (bottom) for the breast cancer data (black dots) and the Dow Jones returns (red diamonds) of Example2.6– in both cases against the index on the x -axis.

Example 2.6 Thebreast cancerdata ofBlake and Merz(1998) consist of 569 records and thirty variables. TheDow Jones returns consist of thirty stocks on 2,528 days over the period from January 1991 to January 2001. Of these, twenty-two stocks are still in the 2012 Dow Jones 30 Index.

The breast cancer data arise from two groups: 212 malignant and 357 benign cases. And for each record, this status is known. We are not interested in this status here but focus on the sample covariance matrix.

The Dow Jones observations are the ‘daily returns’, the differences of log prices taken on consecutive days.

Figure2.5shows the contributions to variance in the top panel and the cumulative con-tributions to variance in the lower panel against the index of the PCs on the x -axis. The curves with black dots correspond to the breast cancer data, and those with red diamonds correspond to the Dow Jones returns.

The eigenvalues of the breast cancer data decrease more quickly than those of the Dow Jones returns. For the breast cancer data, the first PC accounts for 44 per cent, and the second for 19 per cent. For k = 10, the total contribution to variance amounts to more than 95 per cent, and at k= 17 to just over 99 per cent. This rapid increase suggests that

principal components 18 and above may be negligible. For the Dow Jones returns, the first PC accounts for 25.5 per cent of variance, the second for 8 per cent, the first ten PCs account for about 60 per cent of the total contribution to variance, and to achieve 95 per cent of variance, the first twenty-six PCs are required.

The two data sets have the same number of variables and share the lack of an elbow in their scree plots. The absence of an elbow is more common than its presence. Researchers and practitioners have used many different schemes for choosing the number of PCs to represent their data. We will explore two dimension-selection approaches in Sections2.8.1 and10.8, respectively, which are more objective than some of the available ad hoc methods for choosing the number of principal components.

In document REVISTA DE DIDÁCTICAS ESPECÍFICAS (página 57-87)