5. EVALUACIÓN FINANCIERA DEL PROYECTO
5.10. Conclusiones del estudio financiero
The number of spectral channels is usually substantially greater than the number of spectra collected so it seems intuitive to directly project along the spectral domain to reduce the dimensionality. For unsupervised pattern detection any useful dimensionality reduction technique must preserve the chemical differentiation present in the original data. In a tissue image the chemical differentiation would be expected to follow the histology of the tissue section, as different types of tissue have individual molecular signatures.
Algorithm 2.1 details how to spectrally decorrelate using random projections: Algorithm 2.1: spectral random projection
Data: a spectral image Xm×n (with m spectral channels and n pixels), number of random vectors, k
(k < m)
Result: Randomly projected data, Rk×n
1 draw a random matrix Ωk×mfrom an i.i.d normal distribution N (0, 1) ; 2 Form a new matrix of reduced dimensionality: Rk×n= ΩX
As the sampling is random there is no a priori way of knowing what chemical information will be captured by a particular projection but by taking many projections all of the spectral information can be captured. Each random projection samples over the whole m/z domain so every projection should capture a unique subset of the chemical information present. From Figure 2.2 it was seen that for vectors of 100,000 elements a very narrow distribution of angles between random projection vectors was seen so, as the spectra in this example have 130,000 elements, there is no need to orthogonalise or sub-select the projection vectors used in this case. To asses whether the random projections preserved chemical differentiation from this mass
2.7. SPECTRAL RANDOM PROJECTIONS 47 spectrometry image a visual assessment of several scores was made.
Figure 2.3: Examples of the spatial patterns from random projections strongly suggesting the chemical differentiation is preserved through the process. The distributions visible correspond to the anatomical features visible in Figure 2.1,
The number of spectral channels is usually substantially greater than the number of spectra collected so it seems intuitive to directly project along the spectral domain to reduce the dimensionality.
To asses whether the random projections preserved chemical differentiation from this mass spectrometry image a visual assessment of several scores was made. In a tissue image the chemical differentiation would be expected to follow the histology of the tissue section, as different types of tissue have individual molecular signatures. Figure 2.3 shows six score plots from different random projections of the MSI data, there is clear spatial patterning visible that corresponds to tissue types visible in the schematic and ion image from Figure 2.1. This strongly suggests that molecular differentiation is preserved through the projection.
However, it is not recommended that manual inspection of the loadings is attempted. Individually the projections are unlikely to be directly interpretable due to their random nature but the collection of projec- tions contains the information present within the image. As the RP vectors are chosen from a zero-mean Gaussian they contain some values that are negative, accordingly the scores also have both positive and neg- ative values, this makes physical interpretation of the scores difficult. What can be understood is that each projection corresponds to a random orientation of the projection subspace within the original dimensionality, or a different view of the data. It is worth noting that the random matrix generated changes with every analysis so it is not possible to directly compare projections from separate datasets without pre-generating the random projection vectors.
2.7.1
Preservation of Spectral Magnitudes
Figure 2.4: Image magnitude recovery from 500 random projections. Top row: the l1and l2norm calculated
for each spectrum in the original data and displayed as an image. Bottom row: the l1and l2norm calculated
for each set of projections and arranged as an image. Comparing the rows, the l2 norm was successfully
recovered but the l1norm was not.
The most common vector magnitude measurement encountered in MSI is the l1 norm applied to spectra
which is equal to the ’total ion current’ (under the physical assumption that spectra don’t contain negative values), further investigations of the use of norms for MSI contained in Chapter 4. Random projections have been shown to preserve the l2 norm, not the l1, so direct recovery of the TIC is not possible from the
projections, as illustrated in Figure 2.4. The TIC is often used for normalisation of mass spectrometry images, but a comparative study of l1 and l2 normalisation for mass spectrometry imaging found little difference in
image results using the l2norm apart from some bias introduced by its sensitivity towards large magnitude
peaks[53].
2.7.2
Unsupervised learning
The preservation of distance metrics by random projections means they can be used for automated data mining, in this case image segmentation. Following projection, the pixels were automatically assigned into clusters using the popular k-means algorithm [2, 108, 210] and each cluster coloured to segment the image. The
2.7. SPECTRAL RANDOM PROJECTIONS 49
Figure 2.5: Segmenting the fixed rat brain image (into 5 regions using kmeans) as a function of increasing the number of random projections. The clustering extracts the tissue anatomy and becomes stable once 50 or more projections are used
segmentation maps following different numbers of random projections are shown in Figure 2.5. Importantly, this method was applied to a complete mass spectrometry image without requiring any of the traditional data processing pipelines of peak detection and feature extraction. The relative spectral characteristics required by this clustering are preserved following projection, but the number of variables to consider for each pixel is massively reduced, providing an immediate computational time saving and avoiding the curse of dimensionality.
A range of numbers of projections were trialled in order to understand the effect that this has on the clustering quality and consistency. To understand the time saving afforded by dimensionality reduction the time required to perform k-means into 5 clusters on each of these projected datasets was recorded, and is show in Figure 2.6. The segmentation algorithm used here, k-means, is linear in the number of dimensions
Figure 2.6: Time required for k-means segmentation (5 clusters, error bars show standard deviation of 5 repeats) is approximately a liner a function of the dimensionality
included and this directly translates to the computational speed up. The absolute time required to achieve a stable clustering varies between runs as the algorithm is initialised randomly and so takes different numbers of iterations to stabilise. As expected, the overall time is linear in the number of projections, with some small run-to-run fluctuations. The original data dimensionality was the number of m/z channels, i.e. 130,000, so extrapolating the time required for k-means gives an estimated of 2 hours for the calculations. As the raw
2.8. SPATIAL RANDOM PROJECTIONS 51