10.2 Maximum Likelihood Estimation

(1)

Chapter 10

INFERENCE USING

MULTIVARIATE DATA

Samuel S. Wilks(1906 -1964)

American statistician. Wilks first studied architecture, but received his PhD in mathematics. Later, he worked with Karl Pearson at University College, London and with John Wishart at Cambridge. He returned to Princeton as a statistics professor where he stayed throughout his career.

His work included the construction of multivariate generalizations for the analysis of the variance and the multiple correlation coefficient. He was one of the founders of the Institute of Mathematical Statistics (1935), and editor of the journal Annals of Mathematical Statistics for eleven years.

10.1 INTRODUCTION

In this chapter we will present an introduction to inference in multivariate models. It is assumed that the reader is familiar with the basic concepts of inference, and the objective of this chapter is to review results which will be needed for later study. More in-depth presentations can be found in Anderson (1984), Mardia et al. (1979) or Seber (1983).

The first topic to be covered is the estimation of parameters in multivariate normal models using maximum likelihood. Second, the likelihood ratio method is presented as a general procedure for obtaining tests with good properties in large samples. There are other procedures for constructing multivariate tests which are not covered here, but which can be found in Anderson (1984). The next section presents a test for the mean vector in a multivariate normal population. This test is generalized in order to prove the equality of the mean vectors of various multivariate normal populations with the same covariance matrix, which is the principal tool used in the multivariate analysis of variance. A specific case of this test is that of outliers, which can be formulated as a test of whether an observation comes

(2)

from a distribution with a mean differing from the rest of the data. Finally, tests of normality are presented together with possible transformations for obtaining them.

10.2 Maximum Likelihood Estimation

The maximum likelihood method, due to Fisher, takes as estimates those values which maximize the likelihood that the model will generate the observed sample. In order to clarify this idea, suppose that we have a simple random sample ofn elements of a random p−dimensional variable,x,with a density function f(x | θ), where θ= (θ₁, ..., θ_r)⁰ is a vector of parameters which we assume has a dimension r ≤ pn. Letting X = (x₁, ...,x_n), be the sample data, the joint density function of the sample is, from the independence of the observations:

f(X|θ) = Yn

i=1

f(x_i|θ).

When the parameter θis known, this function determines the likelihood of appearance of each sample. In estimation, the sample is known butθ is an unknown. We may consider θ as a variable and X as fixed and then we obtain a function which we will call thelikelihood function, `(θ|X), or`(θ):

`(θ|X) =`(θ) = Yn

i=1

f(x_i|θ) Xf ixed;θ variable (10.1) The maximum likelihood estimator, or MLE, is the value of θ which maximizes this likelihood, the function `(θ). Supposing that this function is differentiable and that its maximum does not occur in an extreme of its domain, then the maximum is obtained by solving the system of equations:

∂`(θ)

∂θ₁ = 0 :

∂`(θ)

∂θ_r = 0

The vector θb that satisfies this system of equations corresponds to a maximum of `(θ) if the Hessian matrix of second derivativesH, evaluated inθ, is negative definite:b

H(θ) =b

µ∂²`(θ)

∂θ_i∂θ_j

¶

θ=θb

negative definite.

In this case,θbis the maximum likelihood estimator, or ML estimator of θ. In practice, it is usually easier to obtain the log of the likelihood function:

(3)

L(θ) = ln`(θ) (10.2) which we will call thesupport function. Since the logarithm is a monotonous transformation, both functions have the same maximum, but working with the support has three advantages. First, we go from the product of densities (10.1) to the sum of their logarithms and the resulting expression tends to be simpler than that of the likelihood, which makes it easier to obtain the maximum. Second, by taking the logarithms the multiplicative constants of the density function, which are irrelevant for the maximum, become additive and disappear when derived. Third, the double of the support function with the sign changed provides a general method for judging the fit of a model to the data called the deviance:

D(θ) =−2L(θ)

and the devianceD(θ) measures the discrepancy between the model and the data. The greater the support,L(θ),the greater the concordance between the value of the parameter and the data, and the smaller the deviance. The deviance appears naturally as a global adjustment measure of data to a model.

For distributions whose range of possible values is known a priori and do not depend on any parameter, it can be proved (see, for example, Casella and Berger, 1990) that in very general conditions with respect to the prob- ability distribution model, the maximum likelihood (ML) method provides estimators which (are)

1. Asymptotically unbiased.

2. With asymptotically normal distribution.

3. Asymptotically of minimum variance (efficient).

4. If there exits a sufficient statistic for the parameter, the ML estimator is also sufficient.

5. Invariant in the following sense: ifθbis the ML estimator ofθ, andg(θ) is a function of the parameters, theng(θ) is the ML estimator ofb g(θ).

10.3 Estimation of the p-dimensional normal.

Let x₁, ...,x_n be a random sample, where x_i ∼ N_p(µ,V). We are going to find the ML estimators of the unknown parametersµand V. The first step is to build the joint density function of the observations, which, using the expression of the multivariate normal studied in Chapter 8, is:

f(X|µ,V) = Yn

i=1

|V|^−1/2(2π)^−p/2exp©

−(1/2)(x−µ)⁰V⁻¹(x−µ)ª

(4)

and, leaving out constants, the support function will be:

L(µ,V|X) =−n

2 log|V| −1 2

Xn

i=1

(x−µ)⁰V⁻¹(x−µ).

Observe that the support function written in this way is always negative, since both the determinant and the quadratic form are positives because the matrixVis positive definite. This function tells us the support for possible values of the parameters, given the sample. The greater this function (less negative) is the greater the concordance between the parameters and the data. We are going to express this function more conveniently. Letting x= P_n

i=1x_i/n be the sample mean, writing (x_i−µ) = (x_i−¯x+¯x−µ) and developing the quadratic form

Xn

i=1

(x−µ)⁰V⁻¹(x−µ) = Xn

i=1

(x−x)¯ ⁰V⁻¹(x−x) +¯ n(¯x−µ)⁰V⁻¹(¯x−µ) sinceP_n

i=1(x−¯x) = 0.Concentrating on the first term in this decomposition, since a scalar is equal to its trace:

tr Ã _n

X

i=1

(x−x)¯ ⁰V⁻¹(x−¯x)

!

= Xn

i=1

tr£

(x−¯x)⁰V⁻¹(x−¯x)¤

=

= Xn

i=1

tr£

V⁻¹(x−¯x)(x−x)¯ ⁰¤

=tr Ã

V⁻¹ Xn

i=1

(x−¯x)(x−¯x)⁰

! , and letting:

S= 1 n

Xn

i=1

(x_i−x)(x_i−x)⁰, (10.3) be the sample covariance matrix, and substituting in the support function:

L(µ,V|x) =−n

2 log|V| −n

2trV⁻¹S−n

2(¯x−µ)⁰V⁻¹(¯x−µ) (10.4) This is standard expression for the support in samples from a multivariate normal distribution. Observe that this function depends only on the sample through the values¯xand S, which will therefore be sufficient estimators of µ and V. All of the samples that provide the same values of ¯x and S will lead to the same inferences with respect to the parameters.

In order to obtain the estimator of the mean vector of the population, note that sinceV⁻¹ is positive definite, (¯x−µ)⁰V⁻¹(¯x−µ)≥ 0 . As this term appears with a minus sign, the value ofµwhich maximizes the support function is that which makes this term as low as possible. This term becomes zero by taking:

b

µ=¯x (10.5)

(5)

and we conclude that x¯ is the maximum likelihood estimator of µ. By replacing this estimator in the support function it only depends on V. To obtain the maximum of the function with respect toV,we add the constant

n2 log|S|,and write the support as:

L(V|X) = n

2log|V⁻¹S| −n

2trV⁻¹S (10.6)

This expression is useful because written in this way the function does not depend on the units of measurement of the variables. It is also easy to prove (see exercise 10.1) that it is also invariant to non-singular linear transformations of the variables. Lettingλ_i be the eigenvalues of the matrixV⁻¹S we have:

L(V|X) = n 2

Xlogλ_i−n 2

Xλ_i = n 2

X(logλ_i−λ_i).

This expression shows that the likelihood is a sum of functions of the form logx−x.Taking the derivative of this function with respect toxit is clear that the function has a maximum forx= 1.Therefore, L(V|X) is maximum if all the eigenvalues of V⁻¹S are equal to the unit, which, in turn, implies thatV⁻¹S=I.This is achieved by taking

Vb =S (10.7)

as the maximum likelihood estimator ofV.

The M L estimators of µ and V are then ¯x and S. It is shown that, as in the univariate case,x¯∼N_p(µ,1/nV). Furthermore, nS is distributed as a Wishart distribution W_p(n−1,V). The estimator S is skewed, but

n−1n S is an unbiased estimator ofV. These estimators have the asymptotic properties of maximum likelihood estimators: consistency, efficiency and asymptotic normality. In exercise 10.2 we present a more classical way of obtaining these estimators by taking the derivative of the support function.

10.4 The likelihood ratio method

In this section, we will go over the general methodology for constructing tests using the likelihood ratio method and we will apply them to the case of normal populations. Often, we wish to check whether a given sample comes from a distribution with certain known parameters. For example, in the quality control process, samples of elements are taken and a multivariate variable is measured. From there, we wish to check whether the process is in a control state which supposes that the samples come from a normal population with fixed values of the parameters. In other cases, it is interesting to check whether or not several samples come from the same population. For example, we want to check whether different markets are equally profitable, or if different medicines produce similar effects. Finally, if we have based

(6)

our inference on the hypothesis of normality, it is advisable to check whether or not this hypothesis is in accordance with the observed data.

In order to test vector parameters we can apply the theory of the like- lihood ratio test. This theory provides statistical tests with certain optimal properties for large sample sizes. Given ap−dimensional vector parameter, θ,which takes values in Ω (where Ω is a subset of<^p), suppose that we wish to test the hypothesis:

H₀ :θ∈Ω₀,

which establishes that θ is contained within a region Ω₀ of the parametric space, versus the alternative hypothesis:

H₁ :θ∈Ω−Ω₀,

which supposes thatθis not restricted to the region Ω₀. In order to test this hypothesis, we check its ability to predict the observed data, and to do that, we compare the probabilities of obtaining them under both hypotheses. To compute these probabilities we need a value for the vector parameter, which is unknown. The likelihood ratio method solves this problem by taking the value compatible with the hypothesis which makes it more likely to obtain the observed sample. More specifically:

1. The maximum likelihood of obtaining an observed sample underH₀ is found as follows. If Ω₀ determines a unique value for the parameters, θ =θ₀, then the likelihood of the sample for this θ₀is calculated. If Ω₀ permits several values, we choose the value of the parameter which maximizes the likelihood of obtaining the sample. Since the likelihood of the observed sample is proportional to the joint distribution of the observations, we find the likelihood function by substituting the data available in this function. By calculating the maximum of this function in Ω₀, we obtain the maximum likelihood value compatible withH₀, which we represent byf(H₀).

2. The maximum likelihood of obtaining the observed sample underH₁ is calculated by finding the absolute maximum of the function over the entire parametric space. Strictly speaking, it should be calculated in the set Ω−Ω₀, but it is simpler to do it over the the whole space since the results are generally the same. The reason for this is that, usually,H₀ imposes restrictions in the parametric space, whereas H₁ assumes that these restriction do not exist. The likelihood function at its maximum, which corresponds to the ML of the parameters, will be denoted byf(H₁).

Next we compare f(H₀) andf(H₁). To eliminate constants and make the comparison invariant to changes in the scale of variables, we write its

(7)

quotient which we call the likelihood ratio (RV)

:

RV = f(H₀)

f(H₁) (10.8)

By construction RV ≤ 1 and we reject H₀ when RV is small enough.

The region of rejection forH₀ will consequently be defined by:

RV ≤a,

where a is determined by imposing the condition that the level of signif- icance for the test be α. To calculate the value a we first need to know the distribution ofRV whenH₀ is true, which tends to be quite difficult in practice. Nevertheless, when the sample size is large, twice the difference of the support between the alternative and null,H₀ is true, defined by:

λ=−2 lnRV = 2 (L(H₁)−L(H₀)),

where L(H_i) = logf(H_i), i = 0,1 is distributed asymptotically like a χ² with the number of degrees of freedom equal to the difference of the dimension between the spaces Ω, and Ω₀. We intuitively rejectH₀ when the support function of the data forH₁ is significantly greater than forH₀.The difference is established, for large samples, with theχ² distribution. Using the definition of the deviance, this test can be interpreted as the difference between the deviance forH₀ and for H₁:

λ=D(H₀)−D(H₁)

It frequently happens that the dimension of Ω ispand the dimension of Ω₀isp−r, wherer denotes the number of linear restrictions over the vector of parameters. Thus, the number of degrees of freedom in the differences of support functions,λ, is:

g=gl(λ) = dim(Ω)−dim(Ω₀) =p−(p−r) =r equal to the number of linear restrictions imposed byH₀.

10.5 Testing the mean of a normal population

We take a sample (x₁, ...,x_n) of a populationN_p(µ,V). We want to test the hypothesis:

H₀:µ=µ₀, against the alternative:

H₁:µ6=µ₀.

(8)

In order to construct a likelihood test, we calculate the maximum of the likelihood function underH₀ and underH₁. The support function is:

L(µ,V|X) =−n

2 log|V| −1 2

Xn

i=1

(x−µ)⁰V⁻¹(x−µ).

We need to obtain theM Lestimators of µand Vunder H₀ and underH₁. From section 10.2 we know that, under H₁, these estimators are ¯x and S, and substituting in (10.4) we have that the support forH₁ is:

L(H₁) =−n

2 log|S| −np 2

Under H₀ the estimator of µis directly µ₀, and operating in the quadratic form as we saw in section 10.2.2 (taking traces and using the linear properties of the trace) we can write this function as:

L(V|X) =−n

2 log|V| −n

2trV⁻¹S₀ (10.9) where

S₀= 1 n

Xn

i=1

(x_i−µ₀)(x_i−µ₀)⁰. (10.10) If, in the expression (10.9) we add the constant ⁿ₂log|S₀|we then obtain an expression which is analogous to (10.6). Thus, we conclude that S₀ is the ML estimator of V under H₀. Replacing V with S₀ in (10.9) the support forH₀ is

L(H₀) =−n

2log|S₀| − np 2 and the difference of the supports is

λ= 2(L(H₁)−L(H₀)) =nlog|S₀|

|S| (10.11)

Then, we reject H₀ when the support for H₁ is significantly greater than forH₀.This implies that the generalized variance underH₀,(|S₀|) is significantly greater than underH₁. The distribution of λis a χ², whose degrees of freedom are equal to the difference of the dimensions of the space in which the parameters move under both hypotheses. The dimension of the parametric space under H₀ is p+p(p−1)/2 = p(p+ 1)/2, the number of different terms in V, and the dimension of parametric space under H₁ is p+p(p+ 1)/2. The difference is p which are the degrees of freedom of the χ².

In this case, we can obtain the exact distribution of the likelihood ratio without needing an asymptotic distribution. In Appendix 10.2 we prove that:

|S₀|

|S| = 1 + T²

n−1 (10.12)

(9)

where the statistic

T² = (n−1)(¯x−µ₀)⁰S⁻¹(¯x−µ₀),

follows a Hotelling’sT²distribution withpandn−1 degrees of freedom. Us- ing the relationship between theT² and theF distribution,we can calculate the percentiles ofT². Since the difference of the supports is a monotonous function of T², we can use this statistic directly instead of the likelihood test, and we reject H₀ when the T² is large enough. Observe that from (10.11) and (10.12) we can write

λ=nlog(1 + T² n−1)

which is consistent with the asymptotic distribution since for largen, log(1+

a/n) ≈ a/n, and thus λ ≈ T², which we know has an asymptotic χ²_p distribution.

Example: An industrial process manufactures elements whose quality characteristics are measured by a vector of three variables, x. When the process is in the control state, the mean values of the variables must be (12,4,2).In order to prove that the process is working properly, a sample of twenty elements is taken and their characteristics are measured. The sample mean is

¯

x= (11.5, 4.3, 1.2)

and the covariance matrix between these three variables is S=



 10 4 −5

4 12 −3

−5 −3 4





(the numerical values have been simplified for ease of calculation). We observe that if we look at each variable separately

t= (x−µ)√ n/bs

it is a Student’stwithn−1 degrees of freedom, and we would obtain certain values for tfor each variable of t₁ = (11.5−12)√

20/p

20×10/19 =−.68;

t₂= (4.3−4)√ 20/p

20×12/19 =.88; andt₃ = (1.2−2)√ 20/p

20×4/19 = .85. Apparently, looking at each variable separately we find no significant differences between the sample means and those of the process being con- trolled and we would conclude that there is no evidence that the process is out of control. If we now look at the differences by using Hotelling’s test

T² = 19(¯x−µ₀)⁰S⁻¹(¯x−µ₀) = 14.52 To judge the size of this statistics we use the F distribution

F_3,17= ((20−3)/3)(T²/19) = 4.33

(10)

and since the valueF_3,17(.001) = 3.4,we reject, without any doubt, that the process is under control.

In order to understand the reasons for this discrepancy between the multivariate and the univariates tests, we observe that the multivariate test takes into account the correlations between individual discrepancies. The correlations matrix of the sample data obtained from the covariance matrix is

R=



 1 .37 −0.79 .37 1 −0.43

−0.79 −0.43 1





and shows that the correlation between the first variable and the third is negative. This means that if we observe a value below the mean in the first variable, we expect a value above the mean in the third. In the sample just the opposite happens, and this suggests a displacement of the mean of the process.

10.6 Testing the covariance matrix of a normal population

The likelihood ratio test is applied to test the covariance matrix, in a similar way to the method studied for mean vectors. We are going to look at four ways of testing the covariance matrix of normal variables. In the first case, the null hypothesis states that this matrix takes a given fixed value. In the second, the matrix is diagonal and the variables are uncorrelated. In the third, the variables have the same variance, which is the sphericity test where we assume that the covariance matrix is σ²I. In the fourth case we assume partial sphericity; the covariance matrix can be broken down as a matrix of range m < p plus σ²I . If m = 0 this test is reduced to that of sphericity.

10.6.1 Testing a specific value Suppose that we want to test the hypothesis:

H₀ :V=V₀, against the alternative:

H₁ :V6=V₀

In order to construct a likelihood ratio test, we compute the maximum of the support under H₀ and underH₁, by using the expression:

L(µ,V|x) =−n

2 log|V| −n

2trV⁻¹S−n

2(¯x−µ)⁰V⁻¹(¯x−µ)

(11)

Under H₀,the value of Vis V₀,and µis estimated using x,¯ such that:

L(H₀) =−n

2 log|V₀| − n

2trV₀⁻¹S

UnderH₁,the estimators are¯xandS,such that, as we saw in the above section:

L(H₁) =−n

2 log|S| −np 2 The difference of supports is

λ= 2(L(H₁)−L(H₀)) =nlog|V₀|

|S| +ntrV⁻¹₀ S−np (10.13) We see that the test consists of comparingV₀,a theoretical value, toS,and the comparison is made with the metric of the determinant and that of the trace. The distribution of λ is a χ², with degrees of freedom equal to the difference of the dimensions of space where the parameters move under both hypotheses which isp(p+ 1)/2, the number of different terms in V.

In particular, this test is useful for testing whether V₀ = I. Then the statistic (10.13) reduces to

λ=−nlog|S|+ntrS−np.

10.6.2 Test of independence

Another interesting test is that of independence, where we assume that the matrixV₀ is diagonal. That is:

H₀ :V= diagonal compared to the alternative:

H₁ :V 6= diagonal.

Then the maximum likelihood estimation of V₀ is Vc₀ =diag(S), where diag(S) is a diagonal matrix with terms s_ii equal to those of S, and the statistic (10.13) is reduced to

λ=nlog Qs_ii

|S| +ntrVb⁻¹₀ S−np

and since trVb⁻¹₀ S=trVb^−1/2₀ SVb^−1/2₀ =trR=p,the test is reduced to:

λ=−nlog|R| (10.14)

which is usually written in terms of the eigenvalues ofR.Lettingλ_ibe those eigenvalues, an equivalent form of this test is:

λ=−n Xp

i=1

logλ_i

and its asymptotic distribution is aχ², with degrees of freedom equal to p(p+ 1)/2−p=p(p−1)/2.

(12)

10.6.3 Sphericity test

An important case in the above test is when all of the variables have the same variance and they are uncorrelated. In this case, we gain nothing from analyzing them jointly since they have no information in common. This test is equivalent to assuming that the matrixV₀ is scalar, in other words, V=σ²I, and is called a sphericity test since the distribution of the variables has level curves which are spheres: there is total symmetry in all directions of the space. The test is

H₀ :V=σ²I, as opposed to:

H₁:V/σ²I

ReplacingV₀=σ²Iin (10.13), the support underH₀ is L(H₀) =−np

2 logσ²− n 2σ²trS

and taking the derivative with respect to σ² it is immediately proven that the ML estimator isσb² =trS/p, the average of the variances. The support L(H₁) is the same as in the above test and the difference is

λ=nlogbσ^2p

|S| +ntrS/bσ²−np (10.15) then, substituting bσ²=trS/p the test reduces to:

λ=nplogσb²−nlog|S|

and is asymptotically distributed as aχ²withp(p+1)/2−1 = (p+2)(p−1)/2 degrees of freedom.

10.6.4 (*)Test of partial sphericity

The fourth test we will study is called partial sphericity test because is assumes that the covariance matrix has dependency in an m dimensional space, but in the complementary space of p−m dimensions, the situation of sphericity is given. Thus the structure of dependencies between variables can be explained as a function ofm variables as we will see in the factorial model. We see that there is no need to test whether a square matrix ofp order has rank m < p, because, in this case, the matrix must have exactly p−m null eigenvalues. If this is the case, we will see it by calculating its eigenvalues since if this condition is true in the population it must be true as well in all of the samples. Nevertheless, it makes sense to test whether the matrix hasmrelatively large eigenvalues which correspond tominformative

(13)

directions, andp−m small and equal eigenvalues which correspond to the non-informative directions. This is partial sphericity. The test is:

H₀:V=B+σ²I,and rank (B) =m tested against:

H₁ :V= general

It can be proved using the same principles (see Anderson, 1963) that, letting λ_i be the eigenvalues ofS,the test is:

λ=−n Xp

i=m+1

logλ_i+n(p−m) log P_p

j=m+1λ_j

p−m (10.16)

and it follows asymptotically aχ² distribution with (p−m+ 2)(p−m−1)/2 degrees of freedom. We see that if m = 0, this test reduces to that of sphericity (10.15), and that if the variables are standardizedP_p

j=1λ_j =p, the second term disappears and this test reduces to (10.14).

10.6.5 Improving the adjustments to the asymptotic distribution

The approximation of the likelihood ratio test,λthe to asymptotic distribution,χ², when the sample size is not large can be improved by introducing correction factors. Box (1949) and Bartlett (1954) have shown that the approximations improve if, in the earlier calculations, we replace n_c with n where n_c is less than n and depends on p and the test. For example, Box (1949) proved that the test of independence improves if we replace n_c=n−(2p+ 11)/2 withn. These corrections can be important if the sample size is small,p is large, and the obtained statistic is close to the critical value. They will not be important however, ifp/nis small and the resulting statistic is clearly conclusive in any of its directions. The interested reader can turn to Muirhead (1982) for further reading.

Example: Test whether we can accept that the form of the covariance matrix of the quality measurements in exercise 10.1 isσ²I. If it is not, test whether the variables, although of different variance, are independent.

The estimation of σb² under the null istrS/p = (10 + 12 + 4)/3 = 8,67.

In addition, we prove that|S|= 146.Thus

λ= 60 log 8,67−20 log 146 = 29.92

which must be compared with a χ² with (3 + 2)(3−1)/2 = 5 degrees of freedom, and the value obtained is clearly significant, thus we reject that the variables have the same variance and are uncorrelated.

To carry out the test of independence we transform the variables dividing each one by its variance. That is, we move to new variablesz₁ = x₁/√

10,

(14)

z₂ = x₂/√

12, z₃ = x₃/√

4. In order to find the new covariance matrix, letting D be the diagonal matrix with the elements (1/√

10,1/√

12,1/√ 4), we have:

V_z=DV_xD⁰ =



 1 0.3651 −0.7906

0.3651 1 −0.4330

−0.7906 −0.4330 1



=R_x

and now the test is

λ=−20 log 0,304 = 23.8

which we must now compare with a χ² with 3 degrees of freedom, thus rejecting completely the hypothesis of independence.

10.7 Testing equality of several means: the Mul- tivariate Analysis of Variance

Assume that we have observed a sample of size n of a p dimensional variable which can be stratified into G classes or groups so that there are n₁ observations of group1, ...., n_G of group G. An important problem is to test whether the means of theG classes or groups are equal. We will solve this by applying the likelihood ratio test. The hypothesis to be tested is:

H₀ :µ₁ =µ₂=...=µ_G=µ;

where, additionally, V is a positive definite matrix, and identical in the groups. The alternative hypothesis is:

H₁ : all theµ_i are not equal;

with the same conditions for V.

The likelihood function under H₀ of a normal homogenous sample was calculated in section 10.2 and we know that its maximum is reached when b

µ =x and Vb =S. Substituting these estimations in the support function we have that

L(H₀) =−n

2 log|S| − np

2 . (10.17)

UnderH₁, thenobserved vectors are subdivided inton₁ of group 1, . . . , n_G of groupG. The likelihood function underH₁ is

f(µ₁, ..., µ_p, V|X) =|V|^−n/2(2π)^−np/2exp



−1 2

XG

g=1 ng

X

h=1

(x_hg −µ_g)⁰V⁻¹(x_hg −µ_g)



,

wherex_hg is thehvector of variables of the groupg,and µ_g is its mean.

The maximization of this function in the parametric space defined by H₁

(15)

is carried out using the procedure studied in 10.2. The estimation of the mean of each group is the sample mean,µb_g =x_g,and the estimation of the common covariance matrix is obtained using:

XG

g=1 ng

X

h=1

(x_hg−x_g)⁰V⁻¹(x_hg−x_g) =tr



 XG

g=1 ng

X

h=1

(x_hg−x_g)⁰V⁻¹(x_hg−x_g)





XG

g=1 ng

X

h=1

tr¡

V⁻¹(x_hg−¯x_g)(x_hg−¯x_g)⁰¢

=tr ¡

V⁻¹W¢

where

W = XG

g=1 ng

X

h=1

(x_hg −¯x_g)(x_hg−x¯_g)⁰ (10.18) is the matrix of sum of squares within the groups. Substituting in the likelihood function and taking the logarithms, we obtain

L(V|X) = n

2 log|V⁻¹| −n

2trV⁻¹W/n

and, according to the results from 10.2, the common variance for the groups when they have different means is estimated using:

Vb =S_w= 1

nW (10.19)

here W is given by (10.18). Substituting these expressions in the support function we have

L(H₁) =−n

2log|S_w| −np

2 . (10.20)

The difference of supports is:

λ=nlog |S|

|S_w| (10.21)

and we rejectH₀when this difference is large. This implies that the variability as measured by |S| is much greater than the variability when we allow the group means to be different, as measured by |S_w|. Its distribution is, asymptotically, a χ²_g where the degrees of freedom, g, are obtained in the same way as the difference between both parametric spaces. H₀ determines a region Ω₀ where we have to estimate the p components of the vector of common means, a total ofp+p(p+ 1)/2 parameters. Under the hypothesis H₁ we have to estimateGvectors of means as well as the covariance matrix which entailsGp+p(p+ 1)/2 parameters. The difference isg :

g= dim(Ω)−dim(Ω₀) =p(G−1) (10.22) which will be the degrees of freedom of the asymptotic distribution.

(16)

The approximation of theχ²_g distribution of the likelihood ratio test can be improved for small sample sizes. It is proved that the statistic:

λ₀=mlog |S|

|S_w|, (10.23)

where

m= (n−1)−(p+G)/2,

asymptotically follows a χ²_g distribution, where g is given by (10.22), and the approximation is better than in small samples by takingm=n.

The multivariate analysis of variance

This test is the multivariate generalization of the analysis of variance and can be arrived at in the two following ways. Let the total variability of the data be:

T= Xn

i=1

(x_i−x)(x¯ _i−¯x)⁰, (10.24) which measures the deviations with respect to a common mean. We are going to decompose the T matrix as the sum of two matrices. The first, W, is the within groups variability or matrix of deviations with respect to the means of each group, and is given by (10.18). The second measures the between groups variability, explained by the differences between means and which we will denote by B. This decomposition generalizes the classic decomposition of the analysis of variance to the vectorial case. In order to obtain it we add and subtract the group means in the equation ofT, as:

T= XG

g=1 ng

X

h=1

(x_gh−¯x+x¯_g−¯x_g)(x_gh−¯x+¯x_g−¯x_g)⁰

and expanding on this still further it is easy to see that the double product is cancelled and the result is:

T=B+W, (10.25)

whereT is given by (10.24), W by (10.18) andB is calculated by:

B= XG

g=1

n_g(¯x_g−x)(¯¯ x_g−¯x)⁰. The decomposition (10.25) can be expressed as

Total Variability (T) = Explained Variability (B) + Residual Variability (W) which is the usual decomposition of the analysis of variance.

(17)

In order to test whether the means are equal we can compare the size of the matricesTandB.The measurement of their size is the determinant, and the test is based on the ratio |T|/|W|. The exact distribution of this ratio was studied by Wilks and can be approximated to an F distribution (see appendix 10.4). For moderate sizes, the test is similar to the likelihood ratio test (10.23), which can also be written as:

λ₀=mlog |T|

|W| =mlog|W+B|

|W| =mlog|I+W⁻¹B| (10.26) From the point of view of calculating (10.26) as|I+A|= Π(1 +λ_i) where λ_i are the eigenvectors ofA, this statistic reduces to

λ₀ =mX

log(1 +λ_i) whereλ_i are the eigenvectors of the matrixW⁻¹B.

Example: We are going to apply this test to see if detectable differences are observed in small samples in the MEDIFIS data, between the body measurements of men and women in Table A.5. In the sample there are 15 women (sex variable = 0) and 12 men (sex = 1). The first step in the analysis is to calculate the means and covariance matrices of each group, separately, and for the data set. The following table shows measurements for each variable, for the whole sample, and for groups of men and women.

ht wt f tl arml bwth crd kn−al total 168.78 63.89 38.98 73.46 45.85 57.24 43.09 women 161.73 55.60 36.83 70.03 43.33 56.63 41.06 men 177.58 74.25 41.67 77.75 49.00 58.00 45.62

The covariance matrices dividing by n−1 for the whole sample, men and women, are

For women:

Sb_M =





 37.64

22.10 80.40 6.38 7.36 1.92 15.65 12.94 3.06 7.41

9.49 14.39 1.49 3.99 9.42 2.75 7.20 0.76 1.17 2.559 2.94 9.02 9.31 1.98 4.53 1.12 0.95 3.78







For men:

Sb_H =





 45.53

48.84 74.20 9.48 9.63 2.79

14.34 19.34 2.09 12.57 14.86 19.77 3.23 6.18 6.77

9.45 9.90 1.86 2.36 3.02 3.13 8.92 5.23 2.31 1.21 1.84 2.63 6.14







(18)

and for the set of men and women, we calculate it as a weighted mean of these two matrices

Sb_T = (14dS_M + 11Sc_H)/25 which gives us

Sb_T =





 41.11

33.86 77.67 7.476 8.36 2.30 15.07 15.76 2.63 9.68 11.85 16.76 2.25 4.95 8.25

5.70 8.390 1.24 1.70 2.76 3.03 8.98 7.52 2.13 3.07 1.44 1.70 4.82







We are going to calculate the likelihood ratios as a quotient of the average variabilities against the hypothesis. Under H₀ we find that the covariance matrix, when we assume the same mean,S,leads to the average variability

V P(H₀) =|S|^1/7 = 5.77 whereas

V P(H₁) =|S_w|^1/7= 4.67 thus the test is

27((27−1)−(7 + 7)/2) log(5.77/4.67) = 108.5

which must be compared with aχ² with 7 degrees of freedom, and there is no doubt that the differences are significant.

10.8 Tests for outliers

The test of equality of means can be applied, as a particular case, in order to test whether an observation of a sample of normal data is an outlier. The null hypothesis will be that all of the data comes from the same normal population. The alternative hypothesis is that suspicious data has been generated by another, unknown population. In order to characterize this alternative population we can assume (1) that the mean is different, or (2) that the mean is the same and the variance is different. If we assume that both the mean and the covariance matrix are different we would have a problem of identification as it is impossible with only one piece of information to estimate both the mean and the variability. It can be proved that the tests which assume a different mean or different variance are similar (see Pe˜na and Guttman, 1993), and here we will look at the simplest case of a different mean but with the same covariance matrix. In order to apply this test to a suspicious piece of information,x_i,we establish that:

H₀:E(x_i) =µ;

(19)

compared to

H₁ :E(x_i) =µ_i 6=µ;

The likelihood function under H₀ is (10.17). Under H₁, since the esti- mationµ_i isx_i,the estimation of the variance is

S_(i)= 1

n−1W_(i), where

W_(i)= Xn

h=1,j6=i

(x_h−¯x_(i))(x_h−¯x_(i))⁰,

is the estimation of the sum of the squares of the residuals, andx¯_(i) is the mean of the observations. In both cases, the observation x_i is eliminated.

The difference in supports is, specifying (10.26):

λ=nlog |T|

|W_(i)|

and, it is shown in appendix 10.3 that the ratio is verified:

|T|

|W_(i)| = 1 + 1

nD²(x_i,¯x_(i)) whereD²(x_i,¯x_(i)) is:

D²(x_i,x¯_(i)) = (x_i−¯x_(i))⁰S⁻¹_(i)(x_i−¯x_(i)). (10.27) the Mahalanobis distance between the piece of data and the measurement without including it. Thus, in order to carry out the test we calculate the Mahalanobis distance (10.27), which is distributed, if H₀ is true, for large samples as a χ²_p .

In practice, when we want to detect outliers we calculate the distances D²(x_i,x¯_(i)) and this value is compared with the percentile 0.95 or 0.99 from the percentile tables of the maximum of a χ²_p. The problem then is that if there is more than one outlier the power of the test can be quite low, as the estimations of the parameters may have been contaminated. A more recommendable procedure whenever we work with samples that might be heterogenous is to first identify all suspicious observations using the procedures indicated in Chapter 3, and then test them one by one. That is to say, we order all suspicious data by D²(x_i,¯x_(i)) and test whether the clos- est observation can be incorporated into the sample. If this incorporation is rejected, the procedure is terminated and all the suspicious data are de- clared outliers. In the opposite case, the observation is incorporated into the sample, the parameters and Mahalanobis distance are recalculated and the procedure is repeated with the remaining excluded observations.

(20)

10.9 Normality Tests

The most often used methods in multivariate analysis assume joint normality in the observations and, when we have enough data, this hypothesis should be tested.

Univariate normality

The normality of univariate distributions can be tested using aχ², Kolmogorov- Smirnov, Shapiro and Wilks, or with tests based on coefficients of asymmetry and kurtosis.

A= m₃

m^3/2₂ ; K= m₄ m²₂, where

m_h= 1 n

X(x_i−x)^h.

Asymptotically, with normal data, it can be proved that:

A∼N(0; 6/n); K ∼N(3; 24/n) and thus the variable

X²= nA²

6 +n(K−3)² 24

will be distributed, if the hypothesis is true, as a χ² with 2 degrees of freedom. We reject the hypothesis of normality ifX²> χ²₂(α).

Multivariate normality

Multivariate normality implies normality of marginal unidimensional distributions, but the presence of this property does not guarantee the multivariate normality of the data. To test joint normality there are several possible tests, and here we will only comment on the multivariate generalization of the asymmetry and kurtosis tests (See Justel, Pe˜na and Zamar (1997) for a generalization of the Kolmogorov-Smirnov test in multivariate cases).

Defining the asymmetric and kurtosis multivariate coefficients as in Sec- tion 3.6:

A_p= 1 n²

Xn

i=1

Xn

j=1

d³_ij

K_p = 1 n

Xn

i=1

d²_ii

whered_ij = (x_i−x)⁰S⁻¹(x_i−x),it is asymptotically proved:

nA_p/6∼χ²_f con f = 1

6p(p+ 1)(p+ 2)

(21)

K_p ∼N(p(p+ 2); 8p(p+ 2)/n)

The power of this test is not high unless we have a very large sample size.

Two frequent cases in practice in which the hypothesis of joint normality is rejected are:

(1) The marginal distributions are approximately symmetric, and the relationships between the variables are linear, but there are atypical values that cannot be explained using the hypothesis of normality. In this case, if we eliminate (or exclude using a robust estimator) the outliers, the joint normality is not rejected and the methods based on normality can yield good results.

(2) Some of the marginal distributions are asymmetric and there are non-linear relationships between the variables. A simple solution, and one which works well in many cases is to transform the variables in order to obtain symmetry and linear relationships.

10.9.1 Transformations

For scalar variables Box and Cox (1964) suggested the following family of transformations for obtaining normality:

x^(λ)= ^(x+m)

λ−1

λ (λ6= 0) (x >−m) ln (x+m) (λ= 0) (m >0)

whereλis the transformation parameter which is estimated from the data, and the constantm is chosen so thatx+m is always positive. Thus,mwill be zero if we work with data which is either positive, or equal in absolute value to the most negative value observed, in the other case. Assuming m= 0 this family includes the logarithmic transformation, square root and the inverse. Whenλ >1, the transformation produces a greater separation or dispersion of the large values ofx, this being more pronounced the greater the value ofλ, whereas when λ <1 the effect is the opposite. In this case, the greater values ofx tend to concentrate and the smaller values (x <1) to disperse.

These transformations are very useful for marginal distributions. In order to study how to determine the value of a scalar variable, we assume thatm = 0 and that there is a value of λwhich transforms the variable to normal. The relationship between the model for the original data, x, and for the transformed datax^(λ),is:

f(x) =f(x^(λ))

¯¯

¯ dx^(λ)

dx

¯¯

¯, (10.28)

and as:

dx^(λ)

dx = λx^λ−1

λ =x^λ−1

(22)

and assuming that x^(λ) is N(µ, σ²), for certain values of λ, the density function of the original variables is:

f(x) = 1 σ√

2πe⁻

1 2σ2

µ

xλ−1 λ −µ

¶₂

x^λ−1

Therefore, the joint density function ofX= (x_i, ..., x_n),due to the independence of the observations, will be:

f(X) = 1

σⁿ¡√

2π¢_n ÃYn

i=1

x^λ−1_i

! e⁻

1 2σ2

Pµ

xλi−1 λ −µ

¶2

(10.29) and the support function is:

L¡

λ;µ, σ²¢

=−n

2lnσ²−n

2 ln 2π+(λ−1)X

lnx_i− 1 2σ²

X µx^λ_i −1 λ −µ

¶₂ . In order to obtain this function, where λ is fixed, the values of σ² and µ which maximize the likelihood (or the support) are, taking the derivative and setting to zero:

b

σ²(λ) = 1 n

X ³x^(λ)−µb(λ)

´₂ , b

µ(λ) = x^(λ)=Xx^(λ)_i n = 1

n

X µx^λ_i −1 λ

¶ .

By substituting these values in the likelihood we obtain what is called the concentrated likelihood function in λ. Disregarding the constants, its expression is:

L(λ) =−n

2 lnσb(λ)²+ (λ−1)X

lnx_i (10.30)

The procedure for obtaining bλconsists of calculatingL(λ) for different values of λ. The value which maximizes this function is the M Lestimator of the transformation.

For multivariate normality we assume the existence of a vector of pa- rametersλ= (λ₁, ..., λ_p) which produces multivariate normality, whereλ_j is the transformation applied to componentj of the vector. Applying a similar analysis to the univariate case, the concentrated multivariate support function in the vector of the parameters of the transformation is:

L(λ) =−n 2ln

¯¯

¯bΣ

¯¯

¯+ Xp

j=1

"

(λ_j −1) Xn

i=1

lnx_ij

# ,

where the parameters have been estimated applying the usual formulas of transformed data:

b µ= 1

n Xn

i=1

x^(λ)_i ,