empleando termografía de alta resolución
DISEÑO DE LA MAQUINA COSECHADORA AUTOPROPULSADA
In this section, the performance evaluation metrics which are used to evaluate the effectiveness of unsupervised feature selection methods are presented.
3.3.2.1 Root Mean Square Error (RMSE)
The Root Mean Square Error (RMSE) [148] has been utilised as a standard statistical metric to evaluate the performance of models in different research areas [149]. It provides a complete picture of the distribution of error. The RMSE can be expressed as:
RM SE = rPn i(yi−y 0 i)2 n (3.11)
wheren is the number of samples, andyi and y
0
i are the expected and predicted
output respectively.
3.3.2.2 Pearson Correlation Coefficient (PCC)
The Pearson Correlation Coefficient (PCC) is an evaluation metric that is utilised to assess the performance of predictive models. The PCC evaluates the strength of the relationship between two variables. It can be calculated as:
P CC = p nΣxiyi−ΣxiΣyi Σx2 i −(Σxi2) p Σy2 i −(Σyi2) (3.12)
where x and y are vales the two quantitative variables and PCC indicates the linear association between them. A value of PCC that is equal to 1 indicates a perfect linear correlation.
3.3.2.3 Theil’s U Statistics
Theils U statistics [150] is an accuracy measure that evaluates the prediction performance of a model. It can be calculated using the following formula:
U = rRM SE 1/n n P i y2 i × r 1 1/n n P i y02 i (3.13)
where y and y’ are actual and corresponding forecasted values respectively. The RMSE is calculated by using Eq.5.11. A value of U which is closer to 0 indicates greater prediction performance.
3.3.2.4 Mean Absolute Deviation (MAD)
The Mean Absolute Deviation, MAD, is an average estimator of the absolute error of the predictive model. The MAD can be calculated from the following formula: M AD= Pn i |yi−y 0 i| n (3.14)
whereyi is the actual and yi0 is the predicted value andn represents the number
3.3.2.5 Mean Absolute Percentage Error (MAPE)
Mean Absolute Percentage Error, MAPE, estimates the average of absolute per- centage error of the predictive model. The MAPE is formulated as:
M AP E = 1 n n X i |yi−y 0 i| |yi| ∗100 (3.15)
whereyi is the actual and yi0 is the predicted value andn represents the number
of samples.
3.3.2.6 Coefficient of Determination (q2)
The Coefficient of Determination (q2) is a statistical metric based on the pro- portion of variability in a data set. If the value ofq2 is close to 1, is means that
a model has been successfully constructed; on the other hand, negativeq2 values
suggest that a model ineffectively approximates the predicted values [151]. The
q2 metric can be calculated from the following formula:
q2 = 1− n P i (yi−y 0 i)2 n P i (yi−y)2 (3.16)
where y and y0 are actual and corresponding forecasted values respectively, n is the number of samples and y is the mean of all actual values in the prediction data set.
3.3.2.7 Mean Square Error (MSE)
The Mean Square Error (MSE) represents the average of predictive model esti- mation errors, therefore, it measures the prediction performance of the model. The MSE can be expressed as:
M SE= Pn i(yi −y 0 i)2 n (3.17)
where n is the number of samples, and yi and y
0
i are the expected and the pre-
dicted values respectively. The MSE can also be calculated from the RMSE since RMSE = √MSE.
3.4
Summary
In this chapter, the prediction methods, data sets, and statistical validation and performance evaluation techniques which are used in this study to evalu- ate the performance of the proposed methods are presented. The MISO and MIMO regression tasks are performed using SVR and MSVR, respectively. The effectiveness of unsupervised feature selection methods, including the proposed methods, are tested with a total of six different data sets. The RV144 Vaccine data set consists of 100 plasma samples where 20 of which are placebo and 80 are vaccine-injected samples. Each data sample has twenty antibody features that consist of features related to IgG subclass and antigen specificity. The goal of exploiting this data set is to reveal the relationships between antibody features and their effector functions. The peptide binding affinity data sets consist of three different tasks where Tasks 1 and 3 contain nona-peptides which have a total of 5787 amino acid descriptors and Task 2 consists of octa-peptides with a total of 5144 amino acid descriptors. The goal of using this data set is to predict peptide binding affinity values by using the given amino acid descriptors. The GSE40279 data set contains 473034 CpG biomarkers (features) from the whole blood of 656 individuals (samples) aged 11 to 101. The goal of utilising this data set is to disclose age-related CpG dinucleotides (features) and reveal the associations between CpG dinucleotides (features) and chronological age. In this study, k-fold cross validation technique is utilised for model error estimation. In addition, eight different evaluation metrics, namely RMSE, MSE, MAPE, MAD,
q2, U, and PCC are exploited to assess prediction performances of the predictive models.
K-Means Based Unsupervised
Feature Selection
In this chapter, a K-means based unsupervised feature selection framework for regression problems is proposed. First, the K-means algorithm is described along with its advantages and disadvantages. Then, the proposed K-Means based unsupervised feature selection framework for particularly regression problems is presented. Next, existing K-means based feature selection methods are reviewed. Final section presents the results of the application of the proposed method compared to the state-of-the-art unsupervised feature selection techniques as well as the baseline (entire feature set) with the RV144 Vaccine, peptide binding affinity, GSE44763, and GSE40279 data sets.
4.1
Introduction
Clustering can be defined as a way to group data naturally. The K-means [152] is a classic unsupervised learning algorithm that aims to find user-defined number of clusters which are represented by centroids. K means algorithm is practical, simple and typically fast [153]. The process of the K-means algorithm consists of the following steps:
(i) A centroid is defined for each cluster; thus, a total of k centroids are defined. (ii) Each data point is assigned to the closest centroid.
(iii) Centroid positions are recomputed.
(iv) Steps (ii) and (iii) are repeated until no more moves are possible for the centroids.