So far we considered data with a relatively small number of variables – at most thirty for the breast cancer data and the Dow Jones returns – and in each case the number of variables was considerably smaller than the number of observations. These data sets belong to the classical domain, for which the sample size is much larger than the dimension. Classical limit theorems apply, and we understand the theory for the n> d case.
In high dimension, the space becomes emptier as the dimension increases. The simple example in Figure 2.12tries to give an idea how data ‘spread out’ when the number of dimensions increases, here from two to three. We consider 100 independent and identically
0 0.5 1 0 0.51
−1
−0.5 0 0.5 1
0 0.5 1 0.5 0
01 0.5 1
Figure 2.12 Distribution of 100 points in 2D and 3D unit space.
distributed points from the uniform distribution in the unit cube. The projection of these points onto the (x , y)-plane is shown in the left panel of Figure2.12. The points seem to cover the unit square quite well. In the right panel we see the same 100 points with their third dimension; many bare patches have now appeared between the points. If we generated 100 points within the four-dimensional unit volume, the empty regions would further increase.
As the dimension increases to very many, the space becomes very thinly populated, point clouds tend to disappear and generally it is quite ‘lonely’ in high-dimensional spaces.
The term high-dimensional is not clearly defined. In some applications, anything beyond a handful of dimensions is regarded as dimensional. Generally, we will think of high-dimensional as much higher, and the thirty dimensions we have met so far I regard as a moderate number of dimensions. Of course, the relationship between d and n plays a crucial role. Data with a large number of dimensions are called high-dimensional data (HDD). We distinguish different groups of HDD which are characterised by
1. d is large but smaller than n;
2. d is large and larger than n: the high-dimension low sample size data (HDLSS); and 3. the data are functions of a continuous variable d: the functional data.
Our applications involve all these types of data. In functional data, the observations are curves rather than consisting of individual variables. Example2.16deals with functional data, where the curves are mass spectrometry measurements.
Theoretical advances for HDD focus on large n and large d. In the research reported inJohnstone (2001), both n and d grow, with d growing as a function of n. In contrast, Hall, Marron, and Neeman(2005) andJung and Marron(2009) focus on the non-traditional case of a fixed sample size n and let d→ ∞. We look at some of these results in Section2.7 and return to more asymptotic results in Section13.5.
A Word of Caution. Principal Component Analysis is an obvious candidate for summaris-ing HDD into a smaller number of components. However, care needs to be taken when the dimension is large, especially when d> n, because the rank r of the covariance matrix S satisfies
r≤ min{d, n}.
For HDLSS data, this statement implies that one cannot obtain more than n principal components. The rank serves as an upper bound for the number of derived variables that can
2.6 Standardised Data and High-Dimensional Data 49 be constructed with Principal Component Analysis; variables with large variance are ‘in’, whereas those with small variance are ‘not in’. If this criterion is not suitable for particu-lar HDLSS data, then either Principal Component Analysis needs to be adjusted, or other methods such as Independent Component Analysis or Projection Pursuit could be used.
We look at two HDLSS data sets and start with the smaller one.
Example 2.14 The Australian illicit drug market data of Gilmour et al. (2006) contain monthly counts of events recorded by key health, law enforcement and drug treatment agen-cies in New South Wales, Australia. These data were collected across different areas of the three major stakeholders. The combined count or indicator data consist of seventeen sepa-rate data series collected over sixty-six months between January 1997 and June 2002. The series are listed in Table3.2, Example3.3, in Section3.3, as the split of the data into the two groups fits more naturally into the topic of Chapter3. In the current analysis, this partition is not relevant. Heroin, cocaine and amphetamine are the quantities of main interest in this data set. The relationship between these drugs over a period of more than five years has given rise to many analyses, some of which are used to inform policy decisions. Figure1.5 in Section1.2.2shows a parallel coordinate plot of the data with the series numbers on the x -axis.
In this analysis we consider each of the seventeen series as an observation, and the sixty-six months represent the variables. An initial principal component analysis shows that the two series break and enter dwelling and steal from motor vehicles are on a much larger scale than the others and dominate the first and second PCs. The scaling of Definition2.15is not appropriate for these data because the mean and covariance matrix naturally pertain to the months. For this reason, we scale each series and call the new data the scaled (indicator) data.
Figure2.13shows the raw data (top) and the scaled data (bottom). For the raw data I have excluded the two observations, break and enter dwelling and steal from motor vehicles, because they are on a much bigger scale and therefore obscure the remaining observations.
For the scaled data we observe that the spread of the last ten to twelve months is larger than that of the early months.
10 20 30 40 50 60
10 20 30 40 50 60
−2 0 2 4 500 1000
Figure 2.13 Illicit drug market data of Example2.14with months as variables: (top) raw data;
(bottom) scaled data.
0 10 20 30 40 50 60
−0.4
−0.2 0 0.2
2 4 6 8 10 12 14 16
50 100
Figure 2.14 (Top): Cumulative contributions to variance of the raw (black) and scaled (blue) illicit drug market data of Example2.14(bottom): Weights of the first eigenvector of the scaled data with the variable number on the x -axis.
Because d= 66 is much larger than n = 17, there are at most seventeen PCs. The analysis shows that the rank of S is 16, so r< n < d. For the raw data, the first PC scores account for 99.45 per cent of total variance, and the first eigenvalue is more than 200 times larger than the second. Furthermore, the weights of the first eigenvector are almost uniformly distributed over all sixty-six dimensions, so they do not offer much insight into the structure of the data. For this reason, we analyse the scaled data.
Figure2.14displays the cumulative contribution to total variance of the raw and scaled data in the top plot. For the scaled data, the first PC scores account for less than half the total variance, and the first ten PCs account for about 95 per cent of total variance. The first eigenvalue is about three times larger than the second; the first eigenvector shows an interesting pattern which is displayed in the lower part of Figure2.14: For the first forty-eight months the wforty-eights have small positive values, whereas at month forty-nine the sign is reversed, and all later months have negative weights. This pattern is closely linked to the Australian heroin shortage in early 2001, which is analysed inGilmour et al.(2006). It is interesting that the first eigenvector shows this phenomenon so clearly.
Our next HDLSS example is much bigger, in terms of both variables and samples.
Example 2.15 Thebreast tumour (gene expression)data ofvan’t Veer et al.(2002) con-sist of seventy-eight observations and 4,751 gene expressions. Typically, gene expression data contain intensity levels or expression indices of genes which are measured for a large number of genes. In bioinformatics, the results of pre-processed gene microarray experi-ments are organised in an Nc× Ng ‘expression index’ matrix which consists of n= Nc
chips or slides and d = Ng genes or probesets. The number of genes may vary from a few hundred to many thousands, whereas the number of chips ranges from below 100 to maybe 200 to 400. The chips are the observations, and the genes are the variables. The data are often accompanied by survival times or binary responses. The latter show whether the individual has survived beyond a certain time. Gene expression data belong to the class of high-dimension low sample size data.
Genes are often grouped into subgroups, and within these subgroups one wants to find genes that are ‘differentially expressed’ and those which are non-responding. Because of the very large number of genes, a first step in many analyses is dimension reduction. Later
2.6 Standardised Data and High-Dimensional Data 51
0 10 20 30 40 50 60 70
20 60 100
−0.06 −0.04 −0.02 0 0.02 0.04 0.06 0
1000 2000
Figure 2.15 The breast tumour data of Example 2.15: cumulative contributions to variance against the index (top) and histogram of weights of the first eigenvector (bottom).
steps in the analysis are concerned with finding genes that are responsible for particular diseases or are good predictors of survival times.
The breast tumour data of van’t Veer et al. (2002) are given as log10 transformations.
The data contain survival times in months as well as binary responses regarding survival.
Patients who left the study or metastasised before the end of five years were grouped into the first class, and those who survived five years formed the second class. Of the seventy-eight patients, forty-four survived the critical five years.
The top panel of Figure2.15shows the cumulative contribution to variance. The rank of the covariance matrix is 77, so smaller than n. The contributions to variance increase slowly, starting with the largest single variance contribution of 16.99 per cent. The first fifty PCs contribute about 90 per cent to total variance. The lower panel of Figure2.15shows the weights of the first eigenvector in the form of a histogram. In a principal component analysis, all eigenvector weights are non-zero. For the 4,751 genes, the weights of the first eigenvector range from−0.0591 to 0.0706, and as we can see in the histogram, most are very close to zero.
A comparison of the first four eigenvectors based on the fifty variables with the highest absolute weights for each vector shows that PC1has no ‘high-weight’ variables in common with PC2 or PC3, whereas PC2and PC3share three such variables. All three eigenvectors share some ‘large’ variables with the fourth. Figure2.16also deals with the first four princi-pal components in the form of 2D score plots. The blue scores correspond to the forty-four patients who survived five years, and the black scores correspond to the other group. The blue and black point clouds overlap in these score plots, indicating that the first four PCs cannot separate the two groups. However, it is interesting that there is an outlier, observation 54 marked in red, which is clearly separate from the other points in all but the last plot.
The principal component analysis has reduced the total number of variables from 4,751 to 77 PCs. This reduction is merely a consequence of the HDLSS property of the data, and a further reduction in the number of variables may be advisable. In a later analysis we will examine how many of these seventy-seven PCs are required for reliable prediction of the time to metastasis.
A little care may be needed to distinguish the gene expression breast tumour data from the thirty-dimensional breast cancer data which we revisited earlier in this section. Both data sets deal with breast cancer, but they are very different in content and size. We refer to the
−20 0 20
Figure 2.16 Score plots of the breast tumour data of Example 2.15: (top row): PC1scores (x -axis) against PC2–PC4scores; (bottom row) PC2scores against PC3and PC4scores (left and middle) and PC3scores against PC4scores (right).
smaller one simply as the breast cancer data and call the HDLSS data the breast tumour (gene expression) data.
Our third example deals with functional data from bioinformatics. In this case, the data are measurements on proteins or, more precisely, on the simpler peptides rather than genes.
Example 2.16 The ovarian cancer proteomics data of Gustafsson (2011)1 are mass spectrometry profiles or curves from a tissue sample of a patient with high-grade serous ovarian cancer. Figure 2.17 shows an image of the tissue sample stained with haema-toxylin and eosin, with the high-grade cancer regions marked. The preparation and anal-ysis of this and similar samples are described in chapter 6 of Gustafsson (2011), and Gustafsson et al.(2011) describe matrix-assisted laser desorption–ionisation imaging mass spectrometry (MALDI-IMS), which allows acquisition of mass data for proteins or the sim-pler peptides used here. For an introduction to mass spectrometry–based proteomics, see Aebersold and Mann(2003).
The observations are the profiles, which are measured at 14,053 regularly spaced points given by their (x , y)-coordinates across the tissue sample. At each grid point, the counts – detections of peptide ion species at instrument-defined mass-to-charge m/z intervals – are recorded. There are 1,331 such intervals, and their midpoints are the variables of these data.
Because the ion counts are recorded in adjoining intervals, the profiles may be regarded as discretisations of curves. For MALDI-IMS, the charge z is one and thus could be ignored. However, in proteomics, it is customary to use the notation m/z, and despite the simplification z= 1, we use the mass-to-charge terminology.
Figure2.18shows two small subsets of the data, with the m/z values on the x-axis. The top panel shows the twenty-one observations or profiles indexed 400 to 420, and the middle
1For these data, contact the author directly.
2.6 Standardised Data and High-Dimensional Data 53
High-grade Cancer
Figure 2.17 Image of tissue sample with regions of ovarian cancer from Example2.16.
0 400 800 1200
0 3000 6000
0 400 800 1200
0 4000 8000
200 220 240
0 3000 6000
200 220 240
0 3000 6000
Figure 2.18 Mass-spectrometry profiles from the ovarian cancer data of Example2.16with m/z values on the x-axis and counts on the y-axis. Observations 400 to 420 (top) with their zoom-ins (bottom left) and observations 1,000 to 1,020 (middle) with their zoom-ins (bottom right).
Figure 2.19 Score plots of the ovarian cancer data of Example2.16: (top row) PC1scores (x -axis) against PC2–PC4scores; (bottom row) PC2scores against PC3and PC4scores (left and middle) and PC3scores against PC4scores (right).
panel shows the profiles indexed 1,000 to 1,020 of 14,053. The plots show that the number and position of the peaks vary across the tissue. The plots in the bottom row are ‘zoom-ins’
of the two figures above for m/z values in the range 200 to 240. The left panel shows the zoom-ins of observations 400 to 420, and the right panel corresponds to observations 1,000 to 1,020. The peaks differ in size, and some of the smaller peaks that are visible in the right panel are absent in the left panel. There are many m/z values which have zero counts.
The rank of the covariance matrix of all observations agrees with d= 1,331. The eigen-values decrease quickly; the tenth is about 2 per cent of the size of the first, and the one-hundredth is about 0.03 per cent of the first. The first principal component score con-tributes 44.8 per cent to total variance, the first four contribute 79 per cent, the first ten result in 89.9 per cent, and the first twenty-five contribute just over 95 per cent. It is clear from these numbers that the first few PCs contain most of the variability of the data.
Figure2.19shows 2D score plots of the first four principal component scores of the raw data. As noted earlier, these four PCs contribute about 80 per cent to total variance. PC1is shown on the x -axis against PC2to PC4in the top row, and the remaining combinations of score plots are shown in the lower panels.
The score plots exhibit interesting shapes which deviate strongly from Gaussian shapes.
In particular, the PC1 and PC2 data are highly skewed. Because the PC data are centred, the figures show that a large proportion of the observations have very small negative val-ues for both the PC1 and PC2 scores. The score plots appear as connected regions. From these figures, it is not clear whether the data split into clusters, and if so, how many. We return to these data in Section6.5.3, where we examine this question further and find partial answers.