• No se han encontrado resultados

1 INTRODUCCIÓN

1.5 Salud Mental en Dolor Crónico

sum-mations multiplied by a scalar value, known as a “score”. (95) With this technique it is possible to significantly reduce the number of variables in a dataset whilst still maintaining almost all of the spectral information; for example, in an average Raman spectral measurement, data is col-lected between 400–1800 cm−1, with measurements taken approximately every 3 cm−1, thus resulting in ∼500 variables in the original data set, which can be reduced to ∼20 using PCA.

Further analysis can then be applied to these PCs to organise them into groups, or clusters, rep-resenting different pathologies; to achieve this, techniques such as LDA are often utilised. The main advantage of this tool is that it provides a simpler representation of the data and allows for faster classification algorithms to be designed.

4.6.2 Linear discriminant analysis

LDA, also known as Fisher’s discriminant analysis, is a supervised multivariate technique used to optimise class separability by finding the direction that provides the best separation for two or more groups of data. LDA is often applied to PC scores to further reduce the dimensionality of the data set. Similar to PCA, this is achieved by finding a linear combination of vectors that maximise the variance of the dataset, but with the addition of finding the component vectors that maximise the separation between multiple classes, as shown in Figure 4.3. (64) By maximising the variance in this manner, LDA is able to provide the optimum separation for each group, thus improving classification results.

Figure 4.3: Representation of LDA, which maximises the component axes to provide class-separation

So, in short, PCA is an “unsupervised” algorithm, which means it “ignores” class labels, and its goal is to find the directions (PCs) that maximise the variance in a dataset. In contrast, LDA is “supervised” and computes the directions (LDs) that represent the vectors that maximise the separation between multiple classes. In general, dimensionality reduction does not only help with reducing computational costs for a given classification task, but it can also be helpful to

avoid overfitting by minimising the error in parameter estimation.

4.6.3 Cross validation

Cross validation is often used to estimate how accurately a diagnostic classification model will perform. This is achieved by assessing the results of the statistical algorithm when applied to a validation set of data. The most common method used for assessing Raman spectra models is leave-one-out cross validation. Leave-one-out is based on using a single spectrum as the validation set, and the remaining spectra are used as the training set for the algorithm. This is repeated to test each spectrum in the dataset iteratively, and can be used to determine how accurate the model is at predicting the pathological status of the sample. Alternative variations include k-fold cross validation, blind testing, or double blind testing. K-fold cross validation involves partitioning the datasets, such that k spectra are used for validation with the remaining spectra used for training, blind and double blind testing are based on concealing pathological information from the data in order to remove possible observer bias.

Classified as:

Cancerous Healthy

Cancerous samples TP FN

Healthy samples FP TN

Table 4.1: Basic demonstration how samples may be classified following PC-LDA with a cross validation approach. TP, true positive; FN, false negative; FP, false positive; TN, true negative.

Diagnostic classification results are often presented in terms of sensitivity and specificity, which provides a good representation of the performance of the algorithm. Based on the values shown in Table 4.1 for a basic classifier, the sensitivity and specificity values can be calculated as follows:

• Sensitivity:

Refers to the proportion of things that are being looking for that are found, basically how good the model is at getting things right:

sensitivity = TP

TP + FN (4.16)

whereby TP is the true positive, and FN is false negative.

• Specificity:

The proportion of things that are not being looked for that are not found, basically how good the model is at making sure the wrong things aren’t found:

specificity = TN

TN + FP (4.17)

4.7. SUMMARY 46 whereby TN is the true negative, and FP is false positive.

4.7 Summary

Raman micro-spectroscopy is a complex modality that can be applied for disease classification.

The physics of Raman spectroscopy, and the design of Raman micro-spectrometers, have been discussed in Chapters 2 and 3, respectively. However, due to the presence of noise, unwanted background signals, as well as calibration errors, it is necessary to apply a range of numerical methods to produce Raman spectra that are reliable and reproducible. This chapter presents various pre-processing tools that are available for calibration, noise reduction, and normalisa-tion. Additionally, baseline correction methods and algorithms for the removal of unwanted contamination signals are discussed in Section 4.4. These techniques are essential to produce spectra that can be used for the accurate diagnosis of disease pathologies.

Following pre-processing, Raman data collected from different biological samples can be classified using multivariate statistical algorithms. The most common algorithms applied for classification of Raman spectra involve PCA and LDA, as discussed in Section 4.6, with sensi-tivity and specificity results often calculated based on a leave-one-out cross validation approach.

Furthermore, in order to produce reliable classification algorithms, it is important to record Ra-man spectra from biological samples that have been prepared using suitable sample preparation methods, as discussed in further detail in Chapter 5.

Chapter 5