The content of Chapter 2, in which we provide an extension of the classical Mahalanobis distance for functional data, is summarized in Berrendero et al. (2018b). In Chapter 3 we analyze a variable selection technique for linear regression with functional predictors. The first part of the chapter is focused on scalar response problems and it can be found in Berrendero et al. (2018a). The second part includes an extension to stationary functional time series and it is available in Bueno-Larraz and Klepsch (2018). The results of Chapter 4 about functional logistic regression are included in a preliminary draft currently in progress.
Chapter 2
Functional Mahalanobis distance
2.1 Introduction
The classical (finite-dimensional) Mahalanobis distance and its applications
Let X be a random variable taking values in Rdwith non-singular covariance matrix Σ. In many practical situations it is required to measure the distance between two points x1, x2 ∈ Rd when considered as two possible observations drawn from X. Clearly, the
usual (square) Euclidean distance kx1− x2k2= (x1− x2)0(x1− x2) := hx1− x2, x1− x2i
is not a suitable choice since it disregards the standard deviations and the covariances of the components of xi(given a column vector x ∈ Rdwe denote by x0 the transpose of
x). Instead, the most popular alternative is perhaps the classical Mahalanobis distance, M (x1, x2), defined as
M (x1, x2) = (x1− x2)0Σ−1(x1− x2)
1/2
. (2.1)
Very often the interest is focused on studying “how extreme” a point x is within the distribution of X; this is typically evaluated in terms of M (x, m), where m stands for the vector of means of X.
This distance is named after the Indian statistician P. C. Mahalanobis (1893-1972) who first proposed and analyzed this concept (Mahalanobis, 1936) in the setting of Gaussian distributions. Nowadays, some popular applications of the Mahalanobis distance arise in the following fields (the list of references is far from exhaustive):
Supervised classification: this use of M is mostly linked to Gaussian models. It is easy to show that in a supervised classification problem of discriminating among k Gaussian ho- moscedastic populations with (mean, covariance) parameters (m1, Σ), . . . , (mk, Σ) and
prior probabilities π1, . . . , πk, the optimal (Bayes) rule to classify a coming observation
x is just to assign x to population j defined by M2(x, mj) − 2 log πj = max 1≤i≤k M 2(x, m i) − 2 log πi (2.2)
Outliers detection: Robust estimators of M (x, mj) are proposed in Rousseeuw and
van Zomeren (1990) in order to identify outliers and leverage points. In Penny (1996) Mahalanobis distance, combined with jackknife methods, is also used as a tool for outliers detection.
Multivariate depth measures: the function D(x) = 1/(1 + M2(x, m)) has been used as an assessment of “how deep” is the point x into the a population with mean and covariance parameters m and Σ, respectively. See, for example, Zuo and Serfling (2000) for a discussion on the relative merits of this proposal when compared with other popular depth measures.
Hypothesis testing : It is a well-known result in multivariate analysis that, whenever X has a Gaussian distribution in Rd, that is X ∼ Nd(m0, Σ), with a non-singular
covariance matrix Σ, we have
nM2(¯x, m0) ∼ χ2d, (2.3)
where ¯x denotes the vector of sample means of a random sample x1, . . . , xn from X,
and χ2dstands for the distribution chi-square with d degrees of freedom. Of course this result can be used to get a significance test for H0 : m = m0 versus H1: m 6= m0.
When Σ is unknown we can estimate it with the usual empirical estimator from the sample x1, . . . , xn. In that case we have
n(¯x − m0)0S−1(¯x − m0) ∼ Td,n−12 , (2.4)
where Tp,n−12 denotes the Hotelling distribution with d and n − 1 degrees of freedom. A two-sample version of this statistic (and the corresponding test) is also well-known; see, e.g., Rencher (2012, Ch. 5) for additional details on Hotelling’s statistic.
Goodness of fit : while this application is perhaps less relevant than those mentioned in the previous paragraphs, the paper by Mardia (1975) shows some interesting connec- tions between the moments of Hotelling’s statistic and some statistics commonly used in testing multivariate normality.
On the difficulties of defining a Mahalanobis-type distance for functional data
The framework of this thesis is Functional Data Analysis (FDA). In other words, we deal with statistical problems involving functional data. Thus our sample is made of trajectories x1(t), . . . , xn(t) in L2[0, 1] drawn from a second order stochastic process
X(t), t ∈ [0, 1] with m(t) = E[X(t)]. The inner product and the norm in L2[0, 1] are denoted by h·, ·i2and k · k2, respectively. We will henceforth assume that the covariance function K(s, t) = cov(X(s), X(t)) is continuous and positive definite. The function
2.1 Introduction
K defines a linear operator K : L2[0, 1] → L2[0, 1], called covariance operator (already defined in Equation (1.4)), given by
Kf (t) = Z 1
0
K(t, s)f (s)ds.
The aim of this chapter is to extend the notion of the multivariate (finite-dimensional) Mahalanobis distance (2.1) to the functional case when x1, x2 ∈ L2[0, 1]. Clearly, in
view of (2.1), the inverse K−1of the functional operator K should play some role in this extension if we want to keep a close analogy with the multivariate case. Unfortunately, such a direct approach utterly fails since, typically, K is not invertible in general as an operator, in the sense that there is no linear continuous operator K−1 such that K−1K = KK−1 = I, the identity operator.
To see the reason for this crucial difference between the finite and the infinite-dimensional cases, let us recall that some elementary linear algebra yields the following representa- tions for Σx and Σ−1y,
Σx = d X i=1 λi(ei0x)ei, Σ−1y = d X i=1 1 λi (ei0y)ei (2.5)
where λ1, . . . , λd are the, strictly positive, eigenvalues of Σ and {ei, . . . , ed} the corre-
sponding orthonormal basis of eigenvectors.
In the functional case, the classical Karhunen-Loève Theorem (see, e.g., Ash and Gard- ner (2014)) provides X(t) =P
jZjej(t) (in L2 uniformly on t) where {ej} is the basis
of orthonormal eigenfunctions of K and the Zj = hX, eji2 are uncorrelated random
variables with var(Zj) = λj, the eigenvalue of K corresponding to ej. Then, we have
Kx = K ∞ X i=1 hx, eii2ei
Note that the continuity of K(s, t) impliesR1
0
R1
0 K(s, t)
2dsdt < ∞, thus K is in fact a
compact, Hilbert-Schmidt operator. In addition, it is easy to check thatR01R01K(s, t)2dsdt = P∞
i=1λ2i so that, in particular, the sequence {λi} converges to zero very quickly. As a
consequence, there is no hope of keeping a direct analogy with (2.5) since K−1x = ∞ X i=1 1 λi hx, eii2ei (2.6)
will not define in general a continuous operator with a finite norm. Still, for some particular functions x = x(t) the series in (2.6) might be convergent. Hence we could use it formally to define the following template which, suitably modified, could lead
to a general, valid definition for a Mahalanobis-type distance between two functions x and m, f M (x, m) = ∞ X i=1 hx − m, eii22 λi !1/2 , (2.7)
for all x, m ∈ L2[0, 1] such that the series in (2.7) is finite. We are especially con- cerned with the case where x is a trajectory from a stochastic process X(t) and m is the corresponding mean function. As we will see later on, this entails some especial difficulties.
The organization of this chapter
In the next section the connection of the Mahalanobis distance with RKHS theory is introduced, together with the proposed definition. In Section 2.3 some properties of the proposed distance are presented and compared with those of the original multivari- ate definition. Then, a consistent estimator is analyzed in Section 2.4. Finally, some numerical outputs corresponding to different statistical applications can be found in Section 2.5.