SELECCIÓN DE LAS PRINCIPALES BEBIDAS FERMENTADAS

The robust version of PCA (RPCA) is for determining the PCs (eigenvectors) that are expected not to be influenced by outliers. Much research has been carried out on robustifying PCA over the years (Croux and Haesbroeck, 2000; Hubert et al., 2005; Cand´es et al., 2011; Feng et al., 2012). Existing methods can be categorized according to the dimensionality of the data. Some are appropriate for high dimensional data (Xu et al., 2010;Cand´es et al.,2011;Feng et al., 2012) and some are better for low-dimensional data (Croux and Ruiz-Gazen, 2005;

Hubert et al., 2005). We are concerned with 3D point cloud data, where the number of dimensions is considerably smaller than the number of observations or points. Hence we are interested in an efficient method for low-dimensional data. Roughly they can be categorized into two potential methods: (i) those that try to find a robust estimation of the covariance matrix, and (ii) those based on Projection Pursuit (PP), (Friedman and Tukey, 1974) such as by Li and Chen (1985) and Hubert et al. (2002) that try to maximize certain robust estimates of univariate variance to obtain consecutive directions on which the data are projected. Covariance matrix based methods are limited in the case of insufficient data to robustly estimate a high-dimensional covariance matrix and the PP based methods are qualitatively robust and inherit the robustness characteristics of the adopted estimators (Feng et al., 2012). A first group of robust methods use a robust estimator of covariance matrix like M-estimators instead of the classic covariance matrix (Maronna,1976;Campbell,1980). Croux and Haesbroeck(2000) suggested using high-breakdown robust estimators such as the MCD to derive the covariance matrix. Croux and Ruiz-Gazen(1996) proposed robust PCA in which PCs are defined as projections of the data onto directions maximizing the robust scale Qn. The spherical PCA and elliptical PCA are also proposed as the robust PCA in Locantore et al. (1999). Another way of getting robust PCA is to replace the LS cost function by a robust cost function such as the Least Trimmed Square (LTS) estimator (Rousseeuw, 1984; Rousseeuw and Leroy, 2003) or an M-estimator (Maronna, 2005).

Hubert et al. (2005) proposed a version of robust PCA, that combined the ideas of using the robust estimator of the covariance matrix and the PP to take advantages from both the approaches. In this thesis, we choose this method because it yields accurate estimates of outlier-free datasets and more robust estimates for contaminated data, is able to detect exact-fit situations and has the further advantage of outlier diagnostics and classification (Hubert et al.,

2005), all of which are beneficial to our purpose. The RPCA (Hubert et al.,

2005) involves the following steps. First, the data are pre-processed to make sure that the transformed data are lying in a subspace whose dimension m is less than the number of observations n without loss of information. Reducing the data space to the affine subspace spanned by the n observations is especially useful when m ≥n, but even when m < n, the observations may span less than the whole m-dimensional space (Hubert et al., 2005). A useful way for reducing the data space is by using the SVD of the mean-centred data matrix. Second, the h points, where n/2< h < n, i.e. the ‘least outlying’ data points are identified, and a measure of outlyingness is computed by projecting all the data points onto many univariate directions, each of which passes through two individual data points. In order to keep the computation time down, the data set is compressed to PCs defining potential directions. Then, every direction for a point pi is scored by its corresponding value of outlyingness (Stahel, 1981;

Donoho, 1982): wi =argmax v |pivT −cMCD(pivT)| ΣMCD(pivT) , i= 1, . . . , n (2.34)

where pivT denotes a projection of the ith observation onto the v direction, and cMCD and ΣMCD are the MCD based mean and scatter (covariance matrix) on an univariate direction v respectively. The FMCD estimators are used as the robust estimators of the mean and scatter in Eq. (2.34). In the next step, an assumed h (h > n/2) portion of observations with the smallest outlyingness values are used to construct a robust covariance matrix Σh. The larger h can give a more accurate RPCA but a smaller h is better for more robust results. Users can fix it according to their own objectives and from knowledge of their particular data. We use h=d0.5×ne in our algorithms. Then, the method projects the observations onto the d dimensional subspace spanned by the d largest eigenvectors of Σh, and computes mean and the covariance matrix by means of the reweighted MCD estimator, with weights based on the robust

distance of every point. The eigenvectors of this covariance matrix from the reweighted observations are the final robust PCs and the MCD mean serves as a robust mean. The resulting robust PCA is location and orthogonal invariant.

An extra advantage of the RPCA algorithm (Hubert et al., 2005) is that it can identify outliers. There are two types of outliers. One type is the orthogonal outlier that lies away from the subspace spanned by the first d (in our case two) PCs (for a plane) and is identified by using Orthogonal Distance (OD), which is the distance between the observationpi and its projection ˆpi in thed-dimensional PCA subspace. For pi it is defined as:

ODi =||pi−pˆi||=||pi−µˆp−LtTi ||, i= 1, . . . , n (2.35)

where ˆµp is the robust centre of the data, L is the robust loading (PC) matrix, which contains robust PCs as the columns in the matrix, and ti = (pi −µˆp)L is the ith _{robust score. The other type of outlier is identified by the Score Distance} (SD) that is measured within the PCA subspace and is defined as:

SDi = v u u t d X j=1 (t2_ij/lj), i= 1, . . . , n (2.36)

wherelj is thejth eigenvalue of the robust covariance matrix ΣMCDand tij is the ijth element of the score matrix:

Tn,d= (Pn,m−1ncMCD)Lm,d, (2.37)

wherePn,m is the data matrix, 1nis the column vector with allncomponents equal to 1,cMCD is the robust centre, andLm,d is the matrix constructed by the robust PCs. OD and SD are sketched in Figure 2.6b. The cut-off value for the score distance is qχ2

d,0.975, and for the orthogonal distance is a scaled version of χ 2_{. A} scaled version of χ2 is a version of χ2 (g1χ2_g₂), which gives a good approximation of the unknown distribution of the squared ODs (Box,1954), whereg1 and g2 are two parameters estimated by the method of moments (Nomikos and MacGregor,

1995). The reader is referred to Hubert et al. (2005) for more information about the RPCA algorithm.

In Figures 2.6(b and c), we illustrate the orthogonal and score outliers based on 30 3D artificial points (Figure2.6a) including 6 outliers projected onto the fitted plane in Figure 2.6c. The points 25, 26 and 27 marked as green points in Figure2.6c are essentially in the plane as their orthogonal distances are low although they are distant from the mean in the plane (score distance). In Figure2.6b, they are identified as good leverage points. Points 28, 29 and 30 (red points) exceed the cut-off value of orthogonal distance so are treated as orthogonal outliers. Projecting these points into the plane show their score distances. Note that point 29 has a low score distance so would not be identified as an outlier without the orthogonal distance. In Figure 2.6c, the points 28 and 30 have large orthogonal and score distances and are treated as bad leverage points as shown in Figure 2.6b.

(b) (c) 26 25 30 (𝒙𝒊) 29 27 ODi SDi 𝒙 𝒊 28 (𝒙𝒋) 𝒙 𝒋 (a)

Figure 2.6 (a) Scatter plot of the data, outlier detection: (b) diagnostic plot of orthogonal distance versus score distance, and (c) fitted plane. Green points are distant in terms of score and red points are orthogonal outliers.

In document Caracterización físico-química y microbiológica provincia de Pichincha (página 63-71)