activation waveforms, however this is common between the two types. The following example given is based on the application of PCA to biomechanical data (knee flexion angle) from two groups, to determine waveform features of variance that differ. In order, these steps involve:
1. Applying PCA to the combined (group 1 + group 2) dataset 2. Selecting the number of principal components to retain
3. Interpretation of features represented by PCs by assessing loading vectors, extreme plots and reconstruction of original data using calculated PCs
4. Extraction and interpretation of subject and group PC scores 5. Summative interpretation of findings
(1) Applying PCA to combined dataset
Prior to the application of equations outlined in 1.1.3, the raw biomechanical waveforms are normalised to 101 points (Figure 2-9), then data from individual subjects of all tested groups is combined into a single matrix and formatted as described in equation 1.0. This is the case with a single, or multiple groups of data.
Figure 2-9 - Typical biomechanical waveform data for two test groups (green & blue)
normalised to 101 points (0-100%) of the gait cycle.
(2) Selecting the number of principal components to retain
As additional PCs are calculated for a given waveform, a higher cumulative percentage of the variance from the original data is represented (Figure 2-10). The PCs representing
87
the highest variance that are first in order usually contain the most relevant information. However, it is notable that the highest variance components are not necessarily the most discriminatory. Ultimately, PCA is used as a data reduction tool, therefore it is desirable to select the minimum number of PCs (new discrete variables for analysis) as possible, whilst still representing salient waveform information. Therefore, it is necessary to decide on a cut-off, or stopping rule, to find the balance between informative variation and noise. In the example (Figure 2-10), it is evident that four PCs cumulatively represent over 95% of the variance in the data and with visual inspection of the scree plot, used in previous studies to decide PC retention, adding more components does not appear to substantially improve the variance explained (Robertson et al., 2013). Whilst PC4 only represents 5.19% of the variance of the original waveforms, this is still considered adequate to detect group differences over noise, however this can become subjective (Chau, 2001).
To increase objectivity of PC retention methods, previous investigators have defined cut- off rules based on eigenvalues. Kaiser’s rule was one of the first and most commonly adopted, retains factors (PCs) with eigenvalues greater than 1 (Kaiser, 1960). However, this rule has been subjected to criticism, due to the high number of components it tends to retain which often have no meaningful information (Jackson, 1993). A retention rule adopted by Jones (2004, 2006) is to determine factor loadings outside the threshold eigenvalue range of -0.71 to 0.71, which are retained. This results in PCs with at least 50% of the variance from a single point in the waveform considered.
88
Figure 2-10 - Bar plot of variance explained (%) for each PC added to the model, with scree plot
representing cumulative variance (red line). Four PCs cumulatively explain >95% variance.
Within this study, a cut-off of 95% cumulative representation of variance was used. Upon reconstructing the original waveform data using 95% of the represented variance, many of the prominent structures from the original data of most parameters can be seen (see point 3), and if meaningful group differences exist they are likely to have been picked up. Secondly, it is found particularly with biomechanical and muscle activation waveforms that between three to five PCs typically represent up to or above 95% variance per parameter. This is a suitable number of variables for downstream comparison of means testing (e.g. ANOVAs), however increasing the number of variables by increasing the cut- off will elevate the likelihood of finding a random significant difference among the data or make it very difficult to find differences due to the substantial corrections required for multiple testing.
(3) Interpretation of features represented by PCs by assessing loading vectors, extreme plots and PC reconstructions
To recap, the matrix U is an orthogonal transformation that realigns the original data into a new coordinate system. The new coordinates are the PCs, which are aligned with the direction of variation in the data matrix. The columns of U are the eigenvectors of S, often referred to as PC loading vectors. In the presented example, the loading vectors of the first three PCs of the knee flexion angle are shown in Figure 2-11. The loading vectors are waveforms themselves of the same number of points as the original waveforms therefore can be plotted against the same function, and the vertical axis depicts
89
individual sample coefficients of each vector. When the vector is close to zero, very little of the variance at that instant is represented by the PC score, and conversely the regions of the vector that are farthest from zero during the series can be interpreted as the most represented regions of variance captured by that PC.
Figure 2-11 - Loading vectors for PC1, 2 and 3 of the knee flexion angle.
As explained in section 1.1.3 (equation 1.6), the PCA model is defined as 𝒁 = [𝑿 − 𝑿]𝑼. Since the eigenvector matrix is both orthogonal and normalized (orthonormal), it is possible to rearrange the equation so that 𝑿 = 𝒁𝑼𝒕+ 𝑿, allowing reconstruction of the
original waveform data from the individually calculated PC loading vectors. Ultimately, these reconstructions can be used to aid interpretation of the feature of variance captured by the PC. Figure 2-12 depicts an example where all individual subject waveforms (Figure 2-9) are reconstructed using PC1, PC2 and PC3. The variance represented by the PC is visually interpreted as the shape of the overall data in regions of the waveform where the vectors do not overlap. For example, in Figure 2-12, PC1 represents the overall magnitude of the waveform, PC2 the degree of flexion during 70 – 95% of the gait cycle, and PC3 the range of knee flexion during 0 – 60% of the gait cycle. It is also possible to aid interpretation of the PCs in a similar way by utilizing extreme plots. For an individual component, this involves plotting each point of the loading vector multiplied by the mean PC score, then also ±1 of the standard deviation (Figure 2-12). All three vectors within the same plot allows visualisation and interpretation of what the data looks like when reconstructed, with single vectors representing standard deviation of the grouped reconstructed data rather than all individual waveforms.
90
Figure 2-12 - (A) Reconstructions of individual subject waveform data using PC loading vectors.
(B) Extreme plots of reconstructed data using PC loading vectors and mean (solid) and ±1 std (dotted) PC scores
(4) Extraction and interpretation of subject and group PC scores
Individual transformed observations are referred to as PC scores and are represented by the individual elements of each column of Z. PC scores are produced by combination of the eigenvector coefficients to the original data points. For example, to generate the first PC score for subject i, the calculation is:
𝑍
1𝑖= (𝑥
𝑖1− 𝑥
1)𝑢
11+ (𝑥
𝑖2− 𝑥
2)𝑢
21+ ⋯ + (𝑥
1𝑝− 𝑥
𝑝)𝑢
𝑝191
The PC score, 𝑧1𝑖 is just a linear combination of each time sample (mean corrected) of that subject’s waveform and the PC loading vector coefficients. Thus, every observation (subject) included in the model is assigned a PC score per calculated PC, which is representative of the distance of the shape of that subject’s waveform feature from the mean PC feature (calculated from the whole dataset). The more dissimilar the shape of the that subjects feature is from the mean, the further away both positive and negative direction their PC score is from 0, since the mean is zero-centred. This is more apparent when studying the subject waveform reconstructions, whereby the most extreme reconstructed individual waveforms are assigned the highest or lowest PC scores. The advantage of scores representative of each subject waveforms contribution to the PC is that it is possible to then use clustering analysis, regression or comparison of means testing (e.g. t-tests) to statistically compare or relate waveform features among different test groups. Once the PC scores are calculated, their class label can be elucidated, and then downstream testing applied.
(5) Summative interpretation and presentation of findings
Within this study, it was of interest to determine statistically differing features of knee FCD and control subjects biomechanical and muscle activation waveforms. Due to the low sample size, Mann-Whitney U tests were applied to the PC scores between groups to test for differences (p≤0.05). Within the biomechanics results section of chapter 2, the interpretation of the biomechanical waveform features captured by the PC alongside the Mann-Whitney U test outcome (p-values) are reported.
For features that statistically differed between groups, an extreme plot with the mean and ±1 standard deviation curves were also presented (Figure 2-13). The extreme plots are colour coded based on the determined PC score group differences. For instance, if there was a significant difference (p≤0.05) found between group A (green) who had a mean PC score of +2 and group B (blue) who had a mean PC score of -2, the +1 std curve will be colour coded blue and the -1 std curve coded green to signify how the overall group variances for the given feature differed from the mean plot. This is depicted in Figure 2-13, which represents individual subject PC3 reconstructions previously shown for the knee flexion angle, in the form of a colour coded extreme plot.
92
Figure 2-13 – Conversion of individual subject PC waveform reconstructions from each group
(A) to extreme plots representing the mean (grey), +1 standard deviation (green) and -1 standard deviation (blue) PC waveform reconstructions colour coded to represent group differences corresponding to group mean PC scores.
93