4. ANÁLISIS DE RESULTADOS
4.5.1 ANÁLISIS ESTADÍSTICO DE LA INTENSIDAD DE SABOR
In this section, we propose six techniques for robust local planar surface fitting in laser scanning 3D point cloud data. The techniques can be classified into three algorithms based on the statistical approaches used: (i) diagnostics approach (ii) robust version of PCA, and (iii) the combination of diagnostics and robust PCA. Robust estimators of the mean vector (simply the mean) and covariance matrix from Fast-MCD and DetMCD are used in the three algorithms. The proposed algorithms, namely diagnostic PCA, robust PCA, and diagnostic robust PCA use a robust mean vector and a covariance matrix to get robust distance for finding outliers and to determine the ‘outlyingness’ measure wi in robust PCA. Outlier detection methods involve robust distance, which uses the robust mean vector and covariance matrix. The workflow for the proposed algorithms is shown in Figure3.1. Each of the stages in the workflow will be described in the following sections.
Section 3.3.1 Estimation of robust mean and covariance matrix
for the local neighbourhood
Section 3.3.2 Calculation of robust distance
for finding outliers
Sections 3.3.2 and 3.4.2 Calculation of robust distance
and outlyingness measures
Section 3.4 Implementation
Section 3.4.3 Diagnostic Robust PCA
Section 3.4.1 Diagnostic PCA
Section 3.4.2 Robust PCA
Figure 3.1 Work flow for the proposed algorithms.
3.3.1
Derivation of Robust Mean Vector and Covariance
Matrix
In this chapter, we are interested in fitting local planar surfaces to 3D point cloud data. We represent the point cloud of n points in three dimensions as a Pn×3 matrix, P = (p1, . . . , pn)T, with the ith observation pi = (pi1, pi2, pi3). As discussed in Section 2.3.2 Chapter 2, the Minimum Covariance Determinant (MCD) estimator (Rousseeuw,1984) is a high breakdown estimator of the mean
and covariance matrix. The MCD estimator searches for the h (h > n/2) observations whose covariance matrix has the lowest determinant. The computation of the MCD method is not easy and requires an exhaustive search inn points for all the subsets ofh (written as h-subsets) points. Since the MCD estimator has many good theoretical properties including better statistical efficiency, being affine equivariant, having a bounded influence function, and having a breakdown point of 50%, we use the MCD approach for deriving the robust mean and covariance matrix. Although, the MCD in Rousseeuw (1984) was computationally very intensive, later Rousseeuw and Driessen (1999) and more recently Hubert et al. (2012) developed two versions of MCD, which are more efficient and significantly faster than the classical MCD without losing good statistical properties. We illustrate the workflow for the different stages of the MCD algorithm in Figure 3.2.
Output: based on the clean subset (which has the lowest determinant) compute robust c and Σ
Compute mean (c) and covariance matrix (Σ) from h-subset and the Mahalanobis Distances (MDs) for all n points
Input: Point cloud data
Draw a random subset of size h: (𝑛 + 𝑚 + 1)/2 ≤ ℎ < 𝑛
Find h-points which have the least MDs
Collect 10 h-subsets with the lowest determinant of the Σ from 500 iterations
From the h-subsets, perform C-step until converge
C -s te p 500 tim es
Based on h-points compute c and Σ, and calculate MDs for all points
2
tim
es
Sort the MDs in increasing order and find the
h-points which have the least MDs
Figure 3.2 Minimum Covariance Determinant (MCD) algorithm workflow.
In the proposed algorithms, both Fast-MCD (Rousseeuw and Driessen,1999) and Deterministic MCD (Hubert et al., 2012) are used to get robust mean vector and covariance matrix. The Fast-MCD (FMCD) is a resampling algorithm which can avoid a complete enumeration to efficiently estimate the MCD for large amounts
of data. To get an outlier-free initial subset of size m+ 1, many initial random subsets need to be drawn, which is computationally intensive. Rousseeuw and Driessen(1999) fixed the number of iterations at 500 to get a good sample and to keep the computation time to an acceptable level. To minimize the computational time FMCD also uses only two C-steps for each of the 500 initial subsets, and usesselective iteration andnested extensions (whennis large, say more than 600) as two further steps. It then keeps the 10 results with the lowest determinant. From these 10 subsets, C-steps are performed until convergence to get the final h-subset. We use this h-subset to get the final FMCD based robust mean vector and covariance matrix. In addition to the advantages of the MCD, the FMCD algorithm allows exact-fit situations, i.e. when more than hobservations lie on a hyper plane (Rousseeuw and Driessen, 1999).
Recently, Hubert et al. (2012) introduced a deterministic algorithm for the MCD (DetMCD) to get the robust mean vector (location) and covariance matrix (scatter). FMCD draws many random (m+ 1)-subsets to obtain at least one outlier-free subset, whereas DetMCD starts from a few easily computed h-subsets and then performs the C-steps until convergence. It uses the same iteration step but does not draw a random subset. Rather it starts from only a few well-chosen initial estimators followed by the C-steps. DetMCD couples aspects of both the FMCD and the orthogonalized Gnanadesikan and Kettenring estimators (Maronna and Zamar, 2002). This algorithm is almost affine equivariant, and permutation invariant (the result does not depend on the order of the data) but FMCD is not permutation invariant. The authors claimed that DetMCD is much faster than FMCD and at least as robust as FMCD. The reader is referred to Hubert et al. (2012) for more details about DetMCD.
3.3.2
Computation of Robust Distance
We use the well-known distance based multivariate outlier detection technique for 3D point cloud data, where the distance considers the shape (covariance) as well as the centre of the data. Robust distance is employed to find outliers in the sampled data to fit a plane. Mahalanobis Distance (MD) in Eq. (2.31) Chapter 2 is the most popular multivariate measure that computes the distance of an observation from the mean of the data.
Although, it is possible to detect a single outlier by means of MD, this approach is no longer sufficient for multiple outliers because of the well-known masking effect (Rousseeuw and Driessen, 1999). Masking occurs when an outlying subset goes undetected because of the presence of another, usually adjacent, subset (Hadi and Simonoff, 1993). Replacing the classical mean vector and covariance matrix by robust counterparts, a robust method yields a tolerance ellipse that captures the covariance structure of the majority of the dataset. Rousseeuw and van Zomeren (1990) used the Minimum Volume Ellipsoide (MVE) based mean vector and covariance matrix, but we know that MVE has zero efficiency because of its low rate of convergence. We use two versions of robust distances using FMCD and the DetMCD based mean vector and covariance matrix in Eq. (2.33) Chapter 2
namely FRD (Fast-MCD based Robust Distance) and DetRD (DetMCD based Robust Distance). FRD and DetRD for the ith point can be defined respectively as: FRDi = q (pi−cFMCD)T ΣFMCD−1 (pi−cFMCD), i= 1, . . . , n (3.1) DetRDi = q
(pi−cDetMCD)T ΣDetMCD−1 (pi−cDetMCD), i= 1, . . . , n. (3.2)
The cut-off value for identifying outliers is to some extent arbitrary and mainly depends on knowledge about the data. Rousseeuw and van Zomeren (1990) and
Rousseeuw and Driessen (1999) showed that the robust distance follows a Chi-square (χ2) distribution with m (number of variables) degrees of freedom. The authors argued that the observations that have Mahalanobis distance or robust distance (FRD and DetRD) values ≥ qχ2
m,0.975 can be identified as outliers.
To show the performance of MD, FRD and DetRD for multivariate outlier detection, we generate 30 points in two dimensions that have a linear pattern as shown in Figure 3.3. We deliberately deviate, from the majority pattern, one point in Figure 3.3a and five points in Figure3.3b to generate single and multiple outliers in the datasets respectively. Based on the MD, FRD and DetRD values, corresponding ellipses are drawn. First, outliers are identified by using Chi-square criteria, then without the outliers the respective covariance matrices have been derived, which are later used to generate the ellipses for exploring the outliers effect. We see all the methods are successful in identifying a single outlier (Figure 3.3a) as the outlier falls outside the ellipses. In
Figure3.3b, MD fails in the presence of multiple outliers as it includes them in the ellipse. The computed ellipses for MD for one or more outliers are significantly changed or distracted by the outliers. This is the well-known masking effect. The ellipses for FRD and DetRD are not significantly changed by the presence of the outliers and successfully identify all five outlying points without the ellipse directions being affected (Figure 3.3b).
Figure 3.3 Outlier (red point) detection by MD, FRD and DetRD, in the presence of: (a) a single outlier, and (b) multiple and clustered outliers.