1.4 Marco institucional
1.4.2 La Radio de la Asamblea Nacional
quality
Recall that the overall quality of the PCA biplot can be expressed as a weighted average of the p axis predictivities. It is shown below that the overall quality can also be expressed as the weighted average of thensample predictivities, the weights being proportional to the respective samples’ total sum of squares:
ψi= [̂XX̂′] ii [XX′]ii Ð→ψi[XX′]ii=[̂XX̂′]ii Ð→ n ∑ i=1 ψi[XX′]ii= n ∑ i=1 [̂XX̂′]ii Ð→ n ∑ i=1 ψi[XX′]ii=tr{̂XX̂′} .
The overall quality measure can now be expressed as: tr(̂X′X̂) tr(X′X) = n ∑ i=1 ψi [ XX′]ii tr(X′X). (3.4.14)
Unlike the expression of the overall quality in terms of the axis predictivities, the expression of the overall quality in terms of the sample predictivities will not simplify to that of an arithmetic average of the individual predictivities when the PCA biplot is constructed from the standardised measurements.
The fact that the overall quality of the PCA biplot is equal to a weighted average of then sample predictivities implies that a high overall quality does not necessarily suggest that the measurements of all n the samples are accurately approximated in the biplot. Similarly, a low overall quality does not necessarily suggest that the measurements of alln the samples are poorly approximated in the biplot. The aver- age (weighted or arithmetic) of a number of very high sample predictivities together with a few low sample predictivities, can still be very high and similarly, the aver- age of a number of very low sample predictivities together with a few high sample predictivities, can still be very low. The sample predictivities (as well as the axis predictivities) of a PCA biplot with low overall quality should therefore be consid- ered before discarding the biplot - useful information can be gathered regarding the samples (and variables) whose measurements are accurately approximated in the biplot.
Using equation (3.4.14), it can be shown that the increase in the overall quality of the PCA biplot resulting from an increase in its dimensionality can be expressed as the weighted average of the increase in thensample predictivities resulting from the increase in dimensionality, the weights being proportional to the respective samples’ total sum of squares. Consider for instance the increase in the overall quality of the PCA biplot resulting from increasing the dimension of the biplot fromr tor+1:
α(r,r+1)=αr+1−αr =∑n i=1 ψi,r+1 [ XX′]ii tr(X′X)− n ∑ i=1 ψi,r [ XX′]ii tr(X′X) =∑n i=1 (ψi,r+1−ψi,r) [ XX′]ii tr(X′X).
Some of the above mentioned concepts regarding sample predictivities will now be illustrated using the University data set. The sample predictivities of three universities, namely the University of Chicago (UChicago), University of California, Berkeley (UCBerkeley) and Purdue University (Purdue) corresponding to each of the possible dimensionalities of the PCA biplot constructed from the standardised measurements of the University data set, are provided in Table 3.19. Table 3.20
contains the overall quality of ther-dimensional PCA biplot for each r∈[1∶6]. Table 3.19: The sample predictivities of Yale University (Yale) University of Chicago (UChicago), University of California, Berkeley (UCBerkeley) and Purdue University (Purdue) corresponding to the PCA biplot of the University data set constructed from the standardised measurements.
Dim 1 Dim2 Dim 3 Dim 4 Dim5 Dim 6
Yale 0.9386 0.9410 0.9524 0.9994 1.0000 1.0000
UChicago 0.0098 0.4422 0.4526 0.7082 0.9509 1
UCBerkeley 0.1744 0.3003 0.8258 0.9750 0.9837 1
Purdue 0.9694 0.9915 0.9968 0.9988 1 1
Table 3.20: The overall qualities of the PCA biplot of the University data set con- structed from the standardised measurements.
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 76.87 89.98 94.76 97.49 99.56 100.00
The very small sample predictivity of the University of California, Berkeley, cor- responding to the one-dimensional PCA biplot implies that the vector emanating from the origin to the point representing this university in the measurement space lies at a very large angle to its projection onto the one-dimensional PCA biplot space. The almost zero sample predictivity of the University of Chicago correspond- ing to the one-dimensional PCA biplot implies that the vector emanating from the origin to the point representing this university in the six-dimensional measurement space lies almost orthogonal to the one-dimensional PCA biplot space. The very high sample predictivities of Yale University associated with the one-dimensional PCA biplot implies that the angle between the vector stretching from the origin to the point representing the university in the measurement space and the projection of this vector onto the biplot space is very small, suggesting that the point represent- ing Yale University in the measurement space lies very close to the one-dimensional PCA biplot space. The same can be said about the point representing Purdue Uni- versity. The overall quality of the one-dimensional PCA biplot is moderately high, accounting for 76.87% of the total sample variance associated with the vector of measured variables. The very low sample predictivities associated with the Univer- sity of Chicago and the University of California, Berkeley and the very high sample predictivity associated with the Purdue University shows that it is possible that samples in a PCA biplot with moderately high overall quality are very poorly or very accurately approximated. The very small sample predictivity of the University of Chicago corresponding to the three-dimensional PCA biplot, which has an overall quality of 94.76%, confirms that it is possible that a sample in a PCA biplot with very high overall quality is very poorly represented.
measurements of the University data set provided in Figure 3.10. The sample predictivities associated with the University of California, Berkeley and the Uni- versity of Chicago corresponding to this PCA biplot are low while the sample pre- dictivity associated with Yale University is very high. That is, the University of California, Berkeley and the University of Chicago are poorly represented in the two-dimensional PCA biplot in Figure 3.10 while Yale University is very accurately represented. Upon visual inspection of the biplot the University of California, Berke- ley seems to differ substantially from Yale University with respect to the variable
Top10. In reality however, these two universities have identical measurements on this variable.From the biplot it also seems as if the University of California, Berkeley and the University of Chicago are very similar with respect to the variable Top10
whereas in reality these two universities differ substantially with respect to this vari- able. These examples confirm that the position of a sample relative to another in a PCA biplot is meaningless if one or both of these samples are associated with low sample predictivities. SAT 10 10 11 12 13 14 Top10 20 20 40 60 80 100 100 Accept 0 0 20 40 60 80 100 100 SFRatio 5 5 10 15 20 25 25 Expenses −20 −10 0 10 20 30 40 50 60 70 70 80 90 Grad 50 55 60 65 70 75 80 85 90 95 100 100 105 110 115
HarvardPrincetonYaleStanford MIT Duke CalTech Dartmouth Brown JohnsHopkins UChicago UPenn Cornell Northwestern Columbia NotreDameUVir Georgetown CarnegieMellon UMichigan UCBerkeley UWisconsin PennState Purdue TexasA&M
Figure 3.10: The two-dimensional PCA biplot constructed from the standardised measurements of the University data set.
3.5
Summary
The conclusions drawn from the PCA biplot are meaningless if the PCA biplot does not accurately represent reality. Measures of the quality of the different as- pects of the PCA biplot are therefore required in order to evaluate to what extent the relationships and predictions suggested by a PCA biplot can be trusted to be representative of reality.
The overall quality of the PCA biplot measures the overall accuracy of the ap- proximations of the elements of the matrixXthat are read off the predictive biplot axes. The overall quality is therefore a very crude quality measure which does not provide sufficient information regarding the quality of the representation of the indi- vidual samples and variables in the PCA biplot. It is for example possible that some samples (and/or variables) in a PCA biplot with low overall quality, are accurately predicted by the PCA biplot and similarly that some samples (and/or variables) in a PCA biplot with high overall quality, are poorly predicted by the PCA biplot. Since conclusions drawn about samples and variables that are poorly represented in the PCA biplot are likely to be erroneous, measures of the quality of the individ- ual samples and variables are required so that those samples and variables that are poorly represented can be identified.
The sample predictivity of a sample is a measure that quantifies the overall ac- curacy of the approximations of that sample’s measurements that are read off from the predictive biplot axes. The overall quality of the PCA biplot can be expressed as a weighted average of the sample predictivities. The sample predictivities cor- responding to the PCA biplot constructed from the last few principal components, which account for a negligible proportion of the variability in the data set, can be used to identify samples that are likely to deviate substantially from the correlation structure of the bulk of the data set. The axis predictivity of a biplot axis quantifies the predictive ability of that individual biplot axis. The overall quality can also be expressed as a weighted average of the axis predictivities. The adequacy of a biplot axis on the other hand measures how much the biplot axis departs from the corre- sponding Cartesian axis in the measurement space - the less the biplot axis departs from the corresponding Cartesian axis, the higher the adequacy of that biplot axis. The adequacy of a biplot axis is not a trustworthy measure of the predictive ability of that biplot axis but it can in some circumstances provide useful information thereof due to the fact that it is a lower bound for the corresponding axis predictivity.
All four of the quality measures discussed in this chapter are scale dependent. When the measured variables have widely differing standard deviations and the PCA biplot is constructed from the unstandardised measurements, the first few principal components are usually dominated by the variables with standard deviations that are very large compared to those of the other variables. As a result, the biplot axes representing those variables typically have large adequacies and axis predictivities compared to the other biplot axes and the overall quality as a result tends to be overly optimistic.
Each of the quality measures associated with the PCA biplot is defined as a ratio of sums of squared values. The fact that the approximation toX which is produced by the PCA biplot,X̂, is the orthogonal projection of Xonto the subspace spanned
by the firstrright singular vectors ofX, ensures that the decomposition ofXintoX̂ andX−X̂ exhibits both Type A and Type B orthogonality. These two orthogonality
properties validates all four the quality measures that were discussed in this chapter as quality measures.
Chapter 4 - CVA and the CVA biplot
4.1
Introduction
In Chapters 2 and 3 data sets were graphically represented by means of PCA biplots. The PCA biplot is however not designed to represent the group structure underlying data sets comprising samples partitioned into a number of predefined groups - in fact, the group membership of the individual samples does not play any role in the construction of the PCA biplot. At most the PCA biplot can suggest possible differences between the groups by using different plotting characters and/or colours to represent samples belonging to different groups and imposing anα-bag or convex hull for each of the groups as in the example provided at the end of Chapter 2. When a graphical representation of the group structure underlying a data set is desired, it would be more appropriate to represent the data set by means of a Canonical Variate Analysis (CVA) biplot (Gower and Hand (1996), Gardner-Lubbe et al. (2008) and Goweret al. (2011)) which is designed specifically for this purpose.
As its name indicates, the CVA biplot is based on the statistical analysis, CVA. CVA is a linear dimension reduction technique which is used to investigate the dissimilarities (and similarities) amongst groups as measured by the Mahalanobis distance metric and is concerned with both discrimination between the groups as well as classification of new observations of unknown origin. CVA, like PCA, is an MDS technique, but differs from PCA in that the distance metric which stands at the centre of the technique is not the Pythagorean distance metric, but rather the Mahalanobis distance metric.
In this chapter, CVA will be discussed from the viewpoint of three different perspectives, namely (1) as equivalent to Linear Discriminant Analysis (LDA) for the multigroup case; (2) as a special case of Canonical Correlation Analysis (CCA) and (3) as a two-stage procedure where the first stage consist of the transformation of the measurement vectors to a space in which the group centroids are optimally separated and the second stage consist of a least squares approximation in that space. Each of these three methods of defining CVA will be discussed in detail in this chapter. The construction of the CVA biplot is however easiest to understand from the point of view of the two-step approach to CVA.
CVA can be either weighted or unweighted depending on whether the different sizes of the groups are taken into account in the analysis or not. Accordingly the CVA biplot can also be either weighted or unweighted. The construction of the weighted and two different types of unweighted CVA biplots will be discussed in this chapter.
An important assumption of CVA is that the within-group covariance matrices of
all the groups are identical. It is very important to check whether this assumption is appropriate for the data to be investigated prior to performing CVA or constructing a CVA biplot. If tests suggest that the assumption of equal covariance matrices is not appropriate for the data at hand, the data should not be analysed by means of CVA. A more appropriate analysis for such data is Analysis of Distance (AOD). The group structure of the data can then be graphically represented by means of an AOD biplot (Gardneret al., 2005). If the assumption of identical within-group covariance matrices is appropriate for the data set at hand, it is important to test whether all the prespecified groups are in fact different prior to performing CVA or constructing a CVA biplot. The appropriate hypotheses to be tested prior to performing CVA or constructing a CVA biplot will be discussed in Section 4.2.5. The close relationship between CVA and Multivariate Analysis of Variance (MANOVA) is also highlighted in this section.