EL GUSANO - TRATAMIENTO DE LA INFORMACIÓN, AZAR Y PROBABILIDAD

TRATAMIENTO DE LA INFORMACIÓN, AZAR Y PROBABILIDAD

ACTIVIDAD 2 EL GUSANO

The principal component projectionsP_•k are d× n matrices. For each index k ≤ d, the ith column ofP_•k represents the contribution of the i th observation in the direction ofηk. It is convenient to displayP_•k, separately for each k, in the form of parallel coordinate plots with

−400 0 400 800

−20 0 20 40

−5 0 5

PC2 PC1

PC3

−400 0 400 800

−5 0 5

−4 0 4

PC3 PC1

PC4

Figure 2.7 3D score plots from the wine recognition data of Example2.8with different colours for the three cultivars. (left):W^(k); (right): PC1, PC3and PC4.

the variable number on the x -axis. As we shall see in Theorem2.12and Corollary2.14, the principal component projections ‘make up the data’ in the sense that we can reconstruct the data arbitrarily closely from these projections.

The kth principal component projections show how the eigenvectorηk has been modified by the kth scores W_•k, and it is therefore natural to look at the distribution of these scores, here in the form of density estimates. The shape of the density provides valuable information about the distribution of the scores. I use the MATLAB softwarecurvdatSM of Marron (2008), which calculates non-parametric density estimates based on Gaussian kernels with suitably chosen bandwidths.

Example 2.9 We continue with our parallel analysis of thebreast cancerdata and theDow Jones returns. Both data sets have thirty variables, but the Dow Jones returns have about five times as many observations as the breast cancer data. We now explore parallel coordi-nate plots and estimates of density of the first and second principal component projections for both data sets.

The top two rows in Figure2.8refer to the breast cancer data, and the bottom two rows refer to the Dow Jones returns. The left column of the plots shows the principal component projections, and the right column shows density estimates of the scores. Rows one and three refer to PC₁, and rows two and four refer to PC₂.

In both data sets, all entries of the first eigenvector have the same sign; we can verify this in the projection plots in the first and third rows, where each observation is either positive for each variable or remains negative for all variables. This behaviour is unusual and could be exploited in a later analysis: it allows us to split the data into two groups, the positives and the negatives. No single variable stands out; the largest weight for both data sets is about 0.26. Example6.9of Section6.5.1looks at splits of the first PC scores for the breast cancer data. For the Dow Jones returns, a split into ‘positive’ and ‘negative’ days does not appear to lead to anything noteworthy.

The projection plots of the second eigenvectors show the more common pattern, with positive and negative entries, which show the opposite effects of the variables. A closer inspection of the second eigenvector of the Dow Jones returns shows that the variables 3, 13, 16, 17 and 23 have negative weights. These variables correspond to the five information

2.4 Visualising Principal Components 33

Figure 2.8 Projection plots (left) and density estimates (right) of PC1scores (rows 1 and 3) and PC2scores (rows 2 and 4) of the breast cancer data (top two rows) and the Dow Jones returns (bottom two rows) from Example2.9.

technology (IT) companies in the list of stocks, namely, AT&T, Hewlett-Packard, Intel Corporation, IBM and Microsoft. With the exception of AT&T, the IT companies have the largest four weights (in absolute value). Thus PC₂ clearly separates the IT stocks from all others.

It is interesting to see the much larger spread of scores of the breast cancer data, both for PC₁and for PC₂: y-values in the projection plots and the range of x -values in the density plots. In Example2.6we have seen that the first two PCs of the breast cancer data contribute more than 60 per cent to the variance, whereas the corresponding PCs of the Dow Jones returns only make up about 33 per cent of variance. Parallel coordinate views of subsequent PC projections may sometimes contain extra useful information. In these two data sets they do not.

The plots in the right column of Figure 2.8show the scores and their non-parametric density estimates, which I produced with thecurvdatSM software of Marron(2008). The score of each observation is given by its value on the x -axis; for easier visual inspection,

the actual values of the scores are displayed at random heights y as coloured dots, and each observation is represented by the same colour in the two plots in one row: the outlier at the right end in the PC₁breast cancer density plot corresponds to the most positive curve in the corresponding projection plot.

An inspection of the density estimates shows that the PC₁scores of the breast cancer data deviate substantially from the normal density; the bimodal and right skewed shape of this density could reflect the fact that the data consist of benign and malignant observations, and we can infer that the distribution of the first scores isnot Gaussian. The other three density plots look symmetric and reasonably normal. For good accounts of non-parametric density estimation, the interested reader is referred toScott(1992) andWand and Jones(1995).

As mentioned at the beginning of Section2.4, visual inspections of the principal compo-nents help to see what is going on, and we have seen that suitable graphical representations of the PC data may lead to new insight. Uncovering clusters, finding outliers or deducing that the data may not be Gaussian, all these properties aid our understanding and inform subsequent analyses.

The next section complements the visual displays of this section with theoretical properties of principal components.

2.5 Properties of Principal Components

In document REVISTA DE DIDÁCTICAS ESPECÍFICAS (página 101-129)