CAPITULO 2. Caracterización fraseológica de las paremias y fórmulas rutinarias
2.1 Aspectos estilísticos y léxico-semánticos
In this section we present statistical methods used for analysis of the data collected in the experiments conducted. We present the methods in sections corresponding to various techniques used to collect the data. For more in-depth explanations of each technique refer to the data analysis part of the Chapter 1. All of the follow- ing analysis was performed using R statistical computing language unless specified differently.
2.5.1 FTIR Data Analysis
FT-IR data was imported and processed using the R statistical programming envi- ronment. As each well was measured three times the data was averaged by well and assembled to one data set for further analysis. Processed data was subject to PCA (Section 1.7.1) and LDA-PC (Section 1.7.2). The LDA-PC was performed using R package “adegenet” [Jombart and Ahmed, 2011].
2.5.2 NMR Data Analysis
NMR data was first subject to processing using our custom software ProcNMR (ref. to Chapter 4). The standard procedure of data processing included exponential line broadening by 0.3 Hz, Fourier transformation and automatic phasing and referencing to TSP signal. We also performed quality control by measurement of TSP mid-peak width in Hz. A measurement of over 1.5 Hz was taken as an indication of badly calibrated readings. The data was then binned either uniformly, selecting 0.05 ppm width bins or a custom binning pattern was used. The custom binning pattern was developed based on the data obtained in the experiments and included the peaks that did not vary in terms of chemical shift. The binned data was then subject to multivariate analysis. The raw data was also retained for visual inspection.
Principal Component Analysis was the most regularly employed technique for
the NMR data analysis. Before the analysis the data was mean centred and each variable normalised by the standard deviation. Pareto scaling was also tried but did not significantly improve the results compared to scaling by the standard deviation. The first two principal components were plotted as a scatter plot. In some cases further principal components were plotted for better visualization of the data.
Hierarchical Cluster Analysis was another multivariate technique employed
in the analysis of NMR data. Since the first 2-3 principal components in PCA did not always account for the majority of variance HCA was employed in order to investigate the data further. The technique was performed using Euclidean distance metric and complete linkage method. The results were plotted as a dendrogram.
Multiple Dataset Integration was used for modeling the data including the
time-related information. The source code for the software was obtained from “http://github.com/smason/mdipp” and compiled for the Linux operating system. Simulations were run for 100,000 iterations and the clustering agreement matrices were plotted for inspection. The data was mean-centred prior to analysis. Analysis was performed on infected RBC data before and after the subtraction of control RBC data.
2.5.3 Image analysis
As noted above the image data was cleaned and presented as a standard data matrix with samples in the rows and variables in the columns. All data manipulation and
permutation testing programs were custom written for this work in the R statistical programming environment.
The data modelling was performed using PLS-DA (1.7.3) models using 10- fold cross validation for hyperparameter fitting and 20 model ensemble for testing and predictions (Fig. 2.5). The data set X was first randomly split into training set Xtr and test setXts keeping the ratio of samples in each group as close to the
starting ratio as possible. In this case there was 5:4 ratio of “fast” drug samples to “slow”. Both resulting data sets kept the ratio of samples close to 5:4. Next a 10-fold cross-validation procedure was applied in order to select the number of components ˆ
nto be used in the final model. The training data set Xtr was randomly split into
a training subsetXtr0 and a validation subsetXv. A PLS-DA model was then fitted
to the dataXtr0 10 times with a different value ofn={1..10}. The resulting models were tested on the dataXv andQ2 metric calculated for each model. The procedure
was repeated ten times and the resulting Q2 values collected into a 10x10 matrix. The meanQ2 value was then calculated for each value of n. Thenwith the highest mean Q2 was then selected to be used in the final model. A PLS-DA model was then fitted to the dataset Xtr and tested on the dataset Xts. Q2 was calculated
and stored with the model for further use. The whole procedure was repeated 20 times. It resulted in 20 models that had been fitted to various splits of data. The predictions were then performed using the whole ensemble of 20 models. The 20 predictions were combined by averaging the predicted probabilities of the sample being in the “fast” group.
After processing Image data was subjected to permutation testing by fitting PLS-DA models to the data with random permutations of labels as explained in Section 1.7.4. A total of 1000 permutations were run on each dataset with 10-fold cross-validation for selection of number of components for PLS-DA and 20 model ensemble used for prediction. An empirical distribution of model “goodness-of- prediction” metric Q2 was constructed and a p-value for Q2 of correctly labelled data was calculated. The models fitted to the real-labelled data were then used for classification of the MMV data. Each sample group membership was predicted by averaging the predictions of the 20 models.
//////
X
X
trX’
trX
v n={1..10} Fit(X’tr, ni) Test(Xv)Q
2 nX
tsFit(X
tr,ñ)
Test(X
ts)
ñ
Q
2Model
10
20
Figure 2.5: A schematic illustration of the model training, testing and validation approach employed for image analysis. Each block of operations contained in a dotted square is repeated a number of times given in the top right corner of the square.