• No se han encontrado resultados

10.2 Maximum Likelihood Estimation

N/A
N/A
Protected

Academic year: 2023

Share "10.2 Maximum Likelihood Estimation"

Copied!
25
0
0

Texto completo

(1)

Chapter 10

INFERENCE USING

MULTIVARIATE DATA

Samuel S. Wilks(1906 -1964)

American statistician. Wilks first studied architecture, but received his PhD in mathematics. Later, he worked with Karl Pearson at University College, London and with John Wishart at Cambridge. He returned to Princeton as a statistics professor where he stayed throughout his career.

His work included the construction of multivariate generalizations for the analysis of the variance and the multiple correlation coefficient. He was one of the founders of the Institute of Mathematical Statistics (1935), and editor of the journal Annals of Mathematical Statistics for eleven years.

10.1 INTRODUCTION

In this chapter we will present an introduction to inference in multivariate models. It is assumed that the reader is familiar with the basic concepts of inference, and the objective of this chapter is to review results which will be needed for later study. More in-depth presentations can be found in Anderson (1984), Mardia et al. (1979) or Seber (1983).

The first topic to be covered is the estimation of parameters in multivari- ate normal models using maximum likelihood. Second, the likelihood ratio method is presented as a general procedure for obtaining tests with good properties in large samples. There are other procedures for constructing multivariate tests which are not covered here, but which can be found in Anderson (1984). The next section presents a test for the mean vector in a multivariate normal population. This test is generalized in order to prove the equality of the mean vectors of various multivariate normal populations with the same covariance matrix, which is the principal tool used in the multivariate analysis of variance. A specific case of this test is that of out- liers, which can be formulated as a test of whether an observation comes

(2)

from a distribution with a mean differing from the rest of the data. Finally, tests of normality are presented together with possible transformations for obtaining them.

10.2 Maximum Likelihood Estimation

The maximum likelihood method, due to Fisher, takes as estimates those values which maximize the likelihood that the model will generate the ob- served sample. In order to clarify this idea, suppose that we have a simple random sample ofn elements of a random p−dimensional variable,x,with a density function f(x | θ), where θ= (θ1, ..., θr)0 is a vector of parame- ters which we assume has a dimension r pn. Letting X = (x1, ...,xn), be the sample data, the joint density function of the sample is, from the independence of the observations:

f(X) = Yn

i=1

f(xi).

When the parameter θis known, this function determines the likelihood of appearance of each sample. In estimation, the sample is known butθ is an unknown. We may consider θ as a variable and X as fixed and then we obtain a function which we will call thelikelihood function, `(θ|X), or`(θ):

`(θ|X) =`(θ) = Yn

i=1

f(xi) Xf ixed;θ variable (10.1) The maximum likelihood estimator, or MLE, is the value of θ which maximizes this likelihood, the function `(θ). Supposing that this function is differentiable and that its maximum does not occur in an extreme of its domain, then the maximum is obtained by solving the system of equations:

∂`(θ)

∂θ1 = 0 :

∂`(θ)

∂θr = 0

The vector θb that satisfies this system of equations corresponds to a maximum of `(θ) if the Hessian matrix of second derivativesH, evaluated inθ, is negative definite:b

H(θ) =b

µ2`(θ)

∂θi∂θj

θ=θb

negative definite.

In this case,θbis the maximum likelihood estimator, or ML estimator of θ. In practice, it is usually easier to obtain the log of the likelihood function:

(3)

L(θ) = ln`(θ) (10.2) which we will call thesupport function. Since the logarithm is a monotonous transformation, both functions have the same maximum, but working with the support has three advantages. First, we go from the product of densities (10.1) to the sum of their logarithms and the resulting expression tends to be simpler than that of the likelihood, which makes it easier to obtain the maximum. Second, by taking the logarithms the multiplicative constants of the density function, which are irrelevant for the maximum, become additive and disappear when derived. Third, the double of the support function with the sign changed provides a general method for judging the fit of a model to the data called the deviance:

D(θ) =2L(θ)

and the devianceD(θ) measures the discrepancy between the model and the data. The greater the support,L(θ),the greater the concordance between the value of the parameter and the data, and the smaller the deviance. The deviance appears naturally as a global adjustment measure of data to a model.

For distributions whose range of possible values is known a priori and do not depend on any parameter, it can be proved (see, for example, Casella and Berger, 1990) that in very general conditions with respect to the prob- ability distribution model, the maximum likelihood (ML) method provides estimators which (are)

1. Asymptotically unbiased.

2. With asymptotically normal distribution.

3. Asymptotically of minimum variance (efficient).

4. If there exits a sufficient statistic for the parameter, the ML estimator is also sufficient.

5. Invariant in the following sense: ifθbis the ML estimator ofθ, andg(θ) is a function of the parameters, theng(θ) is the ML estimator ofb g(θ).

10.3 Estimation of the p-dimensional normal.

Let x1, ...,xn be a random sample, where xi Np(µ,V). We are going to find the ML estimators of the unknown parametersµand V. The first step is to build the joint density function of the observations, which, using the expression of the multivariate normal studied in Chapter 8, is:

f(X|µ,V) = Yn

i=1

|V|1/2(2π)−p/2exp©

(1/2)(x−µ)0V1(x−µ

(4)

and, leaving out constants, the support function will be:

L(µ,V|X) =−n

2 log|V| −1 2

Xn

i=1

(x−µ)0V1(x−µ).

Observe that the support function written in this way is always negative, since both the determinant and the quadratic form are positives because the matrixVis positive definite. This function tells us the support for possible values of the parameters, given the sample. The greater this function (less negative) is the greater the concordance between the parameters and the data. We are going to express this function more conveniently. Letting x= Pn

i=1xi/n be the sample mean, writing (xi−µ) = (xi¯x+¯x−µ) and developing the quadratic form

Xn

i=1

(x−µ)0V1(x−µ) = Xn

i=1

(xx)¯ 0V1(xx) +¯ n(¯x−µ)0V1(¯x−µ) sincePn

i=1(x¯x) = 0.Concentrating on the first term in this decomposition, since a scalar is equal to its trace:

tr à n

X

i=1

(xx)¯ 0V1(x¯x)

!

= Xn

i=1

tr£

(x¯x)0V1(x¯x)¤

=

= Xn

i=1

tr£

V1(x¯x)(xx)¯ 0¤

=tr Ã

V1 Xn

i=1

(x¯x)(x¯x)0

! , and letting:

S= 1 n

Xn

i=1

(xix)(xix)0, (10.3) be the sample covariance matrix, and substituting in the support function:

L(µ,V|x) =−n

2 log|V| −n

2trV1S−n

2(¯x−µ)0V1(¯x−µ) (10.4) This is standard expression for the support in samples from a multivariate normal distribution. Observe that this function depends only on the sample through the values¯xand S, which will therefore be sufficient estimators of µ and V. All of the samples that provide the same values of ¯x and S will lead to the same inferences with respect to the parameters.

In order to obtain the estimator of the mean vector of the population, note that sinceV1 is positive definite, (¯x−µ)0V1(¯x−µ) 0 . As this term appears with a minus sign, the value ofµwhich maximizes the support function is that which makes this term as low as possible. This term becomes zero by taking:

b

µ=¯x (10.5)

(5)

and we conclude that x¯ is the maximum likelihood estimator of µ. By replacing this estimator in the support function it only depends on V. To obtain the maximum of the function with respect toV,we add the constant

n2 log|S|,and write the support as:

L(V|X) = n

2log|V1S| −n

2trV1S (10.6)

This expression is useful because written in this way the function does not depend on the units of measurement of the variables. It is also easy to prove (see exercise 10.1) that it is also invariant to non-singular linear transforma- tions of the variables. Lettingλi be the eigenvalues of the matrixV1S we have:

L(V|X) = n 2

Xlogλi−n 2

Xλi = n 2

X(logλi−λi).

This expression shows that the likelihood is a sum of functions of the form logx−x.Taking the derivative of this function with respect toxit is clear that the function has a maximum forx= 1.Therefore, L(V|X) is maximum if all the eigenvalues of V1S are equal to the unit, which, in turn, implies thatV1S=I.This is achieved by taking

Vb =S (10.7)

as the maximum likelihood estimator ofV.

The M L estimators of µ and V are then ¯x and S. It is shown that, as in the univariate case,x¯∼Np(µ,1/nV). Furthermore, nS is distributed as a Wishart distribution Wp(n−1,V). The estimator S is skewed, but

n−1n S is an unbiased estimator ofV. These estimators have the asymptotic properties of maximum likelihood estimators: consistency, efficiency and asymptotic normality. In exercise 10.2 we present a more classical way of obtaining these estimators by taking the derivative of the support function.

10.4 The likelihood ratio method

In this section, we will go over the general methodology for constructing tests using the likelihood ratio method and we will apply them to the case of normal populations. Often, we wish to check whether a given sample comes from a distribution with certain known parameters. For example, in the quality control process, samples of elements are taken and a multivariate variable is measured. From there, we wish to check whether the process is in a control state which supposes that the samples come from a normal popu- lation with fixed values of the parameters. In other cases, it is interesting to check whether or not several samples come from the same population. For example, we want to check whether different markets are equally profitable, or if different medicines produce similar effects. Finally, if we have based

(6)

our inference on the hypothesis of normality, it is advisable to check whether or not this hypothesis is in accordance with the observed data.

In order to test vector parameters we can apply the theory of the like- lihood ratio test. This theory provides statistical tests with certain optimal properties for large sample sizes. Given ap−dimensional vector parameter, θ,which takes values in Ω (where Ω is a subset of<p), suppose that we wish to test the hypothesis:

H0 :θ∈0,

which establishes that θ is contained within a region Ω0 of the parametric space, versus the alternative hypothesis:

H1 :θ∈0,

which supposes thatθis not restricted to the region Ω0. In order to test this hypothesis, we check its ability to predict the observed data, and to do that, we compare the probabilities of obtaining them under both hypotheses. To compute these probabilities we need a value for the vector parameter, which is unknown. The likelihood ratio method solves this problem by taking the value compatible with the hypothesis which makes it more likely to obtain the observed sample. More specifically:

1. The maximum likelihood of obtaining an observed sample underH0 is found as follows. If Ω0 determines a unique value for the parameters, θ =θ0, then the likelihood of the sample for this θ0is calculated. If Ω0 permits several values, we choose the value of the parameter which maximizes the likelihood of obtaining the sample. Since the likelihood of the observed sample is proportional to the joint distribution of the observations, we find the likelihood function by substituting the data available in this function. By calculating the maximum of this function in Ω0, we obtain the maximum likelihood value compatible withH0, which we represent byf(H0).

2. The maximum likelihood of obtaining the observed sample underH1 is calculated by finding the absolute maximum of the function over the entire parametric space. Strictly speaking, it should be calculated in the set Ω0, but it is simpler to do it over the the whole space since the results are generally the same. The reason for this is that, usually,H0 imposes restrictions in the parametric space, whereas H1 assumes that these restriction do not exist. The likelihood function at its maximum, which corresponds to the ML of the parameters, will be denoted byf(H1).

Next we compare f(H0) andf(H1). To eliminate constants and make the comparison invariant to changes in the scale of variables, we write its

(7)

quotient which we call the likelihood ratio (RV)

:

RV = f(H0)

f(H1) (10.8)

By construction RV 1 and we reject H0 when RV is small enough.

The region of rejection forH0 will consequently be defined by:

RV ≤a,

where a is determined by imposing the condition that the level of signif- icance for the test be α. To calculate the value a we first need to know the distribution ofRV whenH0 is true, which tends to be quite difficult in practice. Nevertheless, when the sample size is large, twice the difference of the support between the alternative and null,H0 is true, defined by:

λ=2 lnRV = 2 (L(H1)−L(H0)),

where L(Hi) = logf(Hi), i = 0,1 is distributed asymptotically like a χ2 with the number of degrees of freedom equal to the difference of the di- mension between the spaces Ω, and Ω0. We intuitively rejectH0 when the support function of the data forH1 is significantly greater than forH0.The difference is established, for large samples, with theχ2 distribution. Using the definition of the deviance, this test can be interpreted as the difference between the deviance forH0 and for H1:

λ=D(H0)−D(H1)

It frequently happens that the dimension of Ω ispand the dimension of Ω0isp−r, wherer denotes the number of linear restrictions over the vector of parameters. Thus, the number of degrees of freedom in the differences of support functions,λ, is:

g=gl(λ) = dim(Ω)dim(Ω0) =p−(p−r) =r equal to the number of linear restrictions imposed byH0.

10.5 Testing the mean of a normal population

We take a sample (x1, ...,xn) of a populationNp(µ,V). We want to test the hypothesis:

H0:µ=µ0, against the alternative:

H1:µ6=µ0.

(8)

In order to construct a likelihood test, we calculate the maximum of the likelihood function underH0 and underH1. The support function is:

L(µ,V|X) =−n

2 log|V| −1 2

Xn

i=1

(x−µ)0V1(x−µ).

We need to obtain theM Lestimators of µand Vunder H0 and underH1. From section 10.2 we know that, under H1, these estimators are ¯x and S, and substituting in (10.4) we have that the support forH1 is:

L(H1) =−n

2 log|S| −np 2

Under H0 the estimator of µis directly µ0, and operating in the quadratic form as we saw in section 10.2.2 (taking traces and using the linear properties of the trace) we can write this function as:

L(V|X) =−n

2 log|V| −n

2trV1S0 (10.9) where

S0= 1 n

Xn

i=1

(xi−µ0)(xi−µ0)0. (10.10) If, in the expression (10.9) we add the constant n2log|S0|we then obtain an expression which is analogous to (10.6). Thus, we conclude that S0 is the ML estimator of V under H0. Replacing V with S0 in (10.9) the support forH0 is

L(H0) =−n

2log|S0| − np 2 and the difference of the supports is

λ= 2(L(H1)−L(H0)) =nlog|S0|

|S| (10.11)

Then, we reject H0 when the support for H1 is significantly greater than forH0.This implies that the generalized variance underH0,(|S0|) is signif- icantly greater than underH1. The distribution of λis a χ2, whose degrees of freedom are equal to the difference of the dimensions of the space in which the parameters move under both hypotheses. The dimension of the parametric space under H0 is p+p(p−1)/2 = p(p+ 1)/2, the number of different terms in V, and the dimension of parametric space under H1 is p+p(p+ 1)/2. The difference is p which are the degrees of freedom of the χ2.

In this case, we can obtain the exact distribution of the likelihood ratio without needing an asymptotic distribution. In Appendix 10.2 we prove that:

|S0|

|S| = 1 + T2

n−1 (10.12)

(9)

where the statistic

T2 = (n−1)(¯x−µ0)0S1(¯x−µ0),

follows a Hotelling’sT2distribution withpandn−1 degrees of freedom. Us- ing the relationship between theT2 and theF distribution,we can calculate the percentiles ofT2. Since the difference of the supports is a monotonous function of T2, we can use this statistic directly instead of the likelihood test, and we reject H0 when the T2 is large enough. Observe that from (10.11) and (10.12) we can write

λ=nlog(1 + T2 n−1)

which is consistent with the asymptotic distribution since for largen, log(1+

a/n) a/n, and thus λ T2, which we know has an asymptotic χ2p distribution.

Example: An industrial process manufactures elements whose quality characteristics are measured by a vector of three variables, x. When the process is in the control state, the mean values of the variables must be (12,4,2).In order to prove that the process is working properly, a sample of twenty elements is taken and their characteristics are measured. The sample mean is

¯

x= (11.5, 4.3, 1.2)

and the covariance matrix between these three variables is S=

 10 4 5

4 12 3

5 3 4

(the numerical values have been simplified for ease of calculation). We observe that if we look at each variable separately

t= (x−µ) n/bs

it is a Student’stwithn−1 degrees of freedom, and we would obtain certain values for tfor each variable of t1 = (11.512)

20/p

20×10/19 =−.68;

t2= (4.34) 20/p

20×12/19 =.88; andt3 = (1.22) 20/p

20×4/19 = .85. Apparently, looking at each variable separately we find no significant differences between the sample means and those of the process being con- trolled and we would conclude that there is no evidence that the process is out of control. If we now look at the differences by using Hotelling’s test

T2 = 19(¯x−µ0)0S1(¯x−µ0) = 14.52 To judge the size of this statistics we use the F distribution

F3,17= ((203)/3)(T2/19) = 4.33

(10)

and since the valueF3,17(.001) = 3.4,we reject, without any doubt, that the process is under control.

In order to understand the reasons for this discrepancy between the multivariate and the univariates tests, we observe that the multivariate test takes into account the correlations between individual discrepancies. The correlations matrix of the sample data obtained from the covariance matrix is

R=

 1 .37 0.79 .37 1 0.43

0.79 0.43 1

and shows that the correlation between the first variable and the third is negative. This means that if we observe a value below the mean in the first variable, we expect a value above the mean in the third. In the sample just the opposite happens, and this suggests a displacement of the mean of the process.

10.6 Testing the covariance matrix of a normal population

The likelihood ratio test is applied to test the covariance matrix, in a similar way to the method studied for mean vectors. We are going to look at four ways of testing the covariance matrix of normal variables. In the first case, the null hypothesis states that this matrix takes a given fixed value. In the second, the matrix is diagonal and the variables are uncorrelated. In the third, the variables have the same variance, which is the sphericity test where we assume that the covariance matrix is σ2I. In the fourth case we assume partial sphericity; the covariance matrix can be broken down as a matrix of range m < p plus σ2I . If m = 0 this test is reduced to that of sphericity.

10.6.1 Testing a specific value Suppose that we want to test the hypothesis:

H0 :V=V0, against the alternative:

H1 :V6=V0

In order to construct a likelihood ratio test, we compute the maximum of the support under H0 and underH1, by using the expression:

L(µ,V|x) =−n

2 log|V| −n

2trV1S−n

2(¯x−µ)0V1(¯x−µ)

(11)

Under H0,the value of Vis V0,and µis estimated using x,¯ such that:

L(H0) =−n

2 log|V0| − n

2trV01S

UnderH1,the estimators are¯xandS,such that, as we saw in the above section:

L(H1) =−n

2 log|S| −np 2 The difference of supports is

λ= 2(L(H1)−L(H0)) =nlog|V0|

|S| +ntrV10 S−np (10.13) We see that the test consists of comparingV0,a theoretical value, toS,and the comparison is made with the metric of the determinant and that of the trace. The distribution of λ is a χ2, with degrees of freedom equal to the difference of the dimensions of space where the parameters move under both hypotheses which isp(p+ 1)/2, the number of different terms in V.

In particular, this test is useful for testing whether V0 = I. Then the statistic (10.13) reduces to

λ=−nlog|S|+ntrS−np.

10.6.2 Test of independence

Another interesting test is that of independence, where we assume that the matrixV0 is diagonal. That is:

H0 :V= diagonal compared to the alternative:

H1 :V 6= diagonal.

Then the maximum likelihood estimation of V0 is Vc0 =diag(S), where diag(S) is a diagonal matrix with terms sii equal to those of S, and the statistic (10.13) is reduced to

λ=nlog Qsii

|S| +ntrVb10 S−np

and since trVb10 S=trVb1/20 SVb1/20 =trR=p,the test is reduced to:

λ=−nlog|R| (10.14)

which is usually written in terms of the eigenvalues ofR.Lettingλibe those eigenvalues, an equivalent form of this test is:

λ=−n Xp

i=1

logλi

and its asymptotic distribution is aχ2, with degrees of freedom equal to p(p+ 1)/2−p=p(p−1)/2.

(12)

10.6.3 Sphericity test

An important case in the above test is when all of the variables have the same variance and they are uncorrelated. In this case, we gain nothing from analyzing them jointly since they have no information in common. This test is equivalent to assuming that the matrixV0 is scalar, in other words, V=σ2I, and is called a sphericity test since the distribution of the variables has level curves which are spheres: there is total symmetry in all directions of the space. The test is

H0 :V=σ2I, as opposed to:

H1:V2I

ReplacingV0=σ2Iin (10.13), the support underH0 is L(H0) =−np

2 logσ2 n 2σ2trS

and taking the derivative with respect to σ2 it is immediately proven that the ML estimator isσb2 =trS/p, the average of the variances. The support L(H1) is the same as in the above test and the difference is

λ=nlogbσ2p

|S| +ntrS/bσ2−np (10.15) then, substituting bσ2=trS/p the test reduces to:

λ=nplogσb2−nlog|S|

and is asymptotically distributed as aχ2withp(p+1)/21 = (p+2)(p−1)/2 degrees of freedom.

10.6.4 (*)Test of partial sphericity

The fourth test we will study is called partial sphericity test because is assumes that the covariance matrix has dependency in an m dimensional space, but in the complementary space of p−m dimensions, the situation of sphericity is given. Thus the structure of dependencies between variables can be explained as a function ofm variables as we will see in the factorial model. We see that there is no need to test whether a square matrix ofp order has rank m < p, because, in this case, the matrix must have exactly p−m null eigenvalues. If this is the case, we will see it by calculating its eigenvalues since if this condition is true in the population it must be true as well in all of the samples. Nevertheless, it makes sense to test whether the matrix hasmrelatively large eigenvalues which correspond tominformative

(13)

directions, andp−m small and equal eigenvalues which correspond to the non-informative directions. This is partial sphericity. The test is:

H0:V=B+σ2I,and rank (B) =m tested against:

H1 :V= general

It can be proved using the same principles (see Anderson, 1963) that, letting λi be the eigenvalues ofS,the test is:

λ=−n Xp

i=m+1

logλi+n(p−m) log Pp

j=m+1λj

p−m (10.16)

and it follows asymptotically aχ2 distribution with (p−m+ 2)(p−m−1)/2 degrees of freedom. We see that if m = 0, this test reduces to that of sphericity (10.15), and that if the variables are standardizedPp

j=1λj =p, the second term disappears and this test reduces to (10.14).

10.6.5 Improving the adjustments to the asymptotic distri- bution

The approximation of the likelihood ratio test,λthe to asymptotic distribu- tion,χ2, when the sample size is not large can be improved by introducing correction factors. Box (1949) and Bartlett (1954) have shown that the approximations improve if, in the earlier calculations, we replace nc with n where nc is less than n and depends on p and the test. For example, Box (1949) proved that the test of independence improves if we replace nc=n−(2p+ 11)/2 withn. These corrections can be important if the sam- ple size is small,p is large, and the obtained statistic is close to the critical value. They will not be important however, ifp/nis small and the resulting statistic is clearly conclusive in any of its directions. The interested reader can turn to Muirhead (1982) for further reading.

Example: Test whether we can accept that the form of the covariance matrix of the quality measurements in exercise 10.1 isσ2I. If it is not, test whether the variables, although of different variance, are independent.

The estimation of σb2 under the null istrS/p = (10 + 12 + 4)/3 = 8,67.

In addition, we prove that|S|= 146.Thus

λ= 60 log 8,6720 log 146 = 29.92

which must be compared with a χ2 with (3 + 2)(31)/2 = 5 degrees of freedom, and the value obtained is clearly significant, thus we reject that the variables have the same variance and are uncorrelated.

To carry out the test of independence we transform the variables dividing each one by its variance. That is, we move to new variablesz1 = x1/√

10,

(14)

z2 = x2/√

12, z3 = x3/√

4. In order to find the new covariance matrix, letting D be the diagonal matrix with the elements (1/√

10,1/√

12,1/√ 4), we have:

Vz=DVxD0 =

 1 0.3651 0.7906

0.3651 1 0.4330

0.7906 0.4330 1

=Rx

and now the test is

λ=20 log 0,304 = 23.8

which we must now compare with a χ2 with 3 degrees of freedom, thus rejecting completely the hypothesis of independence.

10.7 Testing equality of several means: the Mul- tivariate Analysis of Variance

Assume that we have observed a sample of size n of a p dimensional vari- able which can be stratified into G classes or groups so that there are n1 observations of group1, ...., nG of group G. An important problem is to test whether the means of theG classes or groups are equal. We will solve this by applying the likelihood ratio test. The hypothesis to be tested is:

H0 :µ1 =µ2=...=µG=µ;

where, additionally, V is a positive definite matrix, and identical in the groups. The alternative hypothesis is:

H1 : all theµi are not equal;

with the same conditions for V.

The likelihood function under H0 of a normal homogenous sample was calculated in section 10.2 and we know that its maximum is reached when b

µ =x and Vb =S. Substituting these estimations in the support function we have that

L(H0) =−n

2 log|S| − np

2 . (10.17)

UnderH1, thenobserved vectors are subdivided inton1 of group 1, . . . , nG of groupG. The likelihood function underH1 is

f(µ1, ..., µp, V|X) =|V|−n/2(2π)−np/2exp



1 2

XG

g=1 ng

X

h=1

(xhg −µg)0V1(xhg −µg)



,

wherexhg is thehvector of variables of the groupg,and µg is its mean.

The maximization of this function in the parametric space defined by H1

(15)

is carried out using the procedure studied in 10.2. The estimation of the mean of each group is the sample mean,µbg =xg,and the estimation of the common covariance matrix is obtained using:

XG

g=1 ng

X

h=1

(xhgxg)0V1(xhgxg) =tr

 XG

g=1 ng

X

h=1

(xhgxg)0V1(xhgxg)

XG

g=1 ng

X

h=1

tr¡

V1(xhg¯xg)(xhg¯xg)0¢

=tr ¡

V1

where

W = XG

g=1 ng

X

h=1

(xhg ¯xg)(xhgg)0 (10.18) is the matrix of sum of squares within the groups. Substituting in the likelihood function and taking the logarithms, we obtain

L(V|X) = n

2 log|V1| −n

2trV1W/n

and, according to the results from 10.2, the common variance for the groups when they have different means is estimated using:

Vb =Sw= 1

nW (10.19)

here W is given by (10.18). Substituting these expressions in the support function we have

L(H1) =−n

2log|Sw| −np

2 . (10.20)

The difference of supports is:

λ=nlog |S|

|Sw| (10.21)

and we rejectH0when this difference is large. This implies that the variabil- ity as measured by |S| is much greater than the variability when we allow the group means to be different, as measured by |Sw|. Its distribution is, asymptotically, a χ2g where the degrees of freedom, g, are obtained in the same way as the difference between both parametric spaces. H0 determines a region Ω0 where we have to estimate the p components of the vector of common means, a total ofp+p(p+ 1)/2 parameters. Under the hypothesis H1 we have to estimateGvectors of means as well as the covariance matrix which entailsGp+p(p+ 1)/2 parameters. The difference isg :

g= dim(Ω)dim(Ω0) =p(G−1) (10.22) which will be the degrees of freedom of the asymptotic distribution.

(16)

The approximation of theχ2g distribution of the likelihood ratio test can be improved for small sample sizes. It is proved that the statistic:

λ0=mlog |S|

|Sw|, (10.23)

where

m= (n−1)(p+G)/2,

asymptotically follows a χ2g distribution, where g is given by (10.22), and the approximation is better than in small samples by takingm=n.

The multivariate analysis of variance

This test is the multivariate generalization of the analysis of variance and can be arrived at in the two following ways. Let the total variability of the data be:

T= Xn

i=1

(xix)(x¯ i¯x)0, (10.24) which measures the deviations with respect to a common mean. We are going to decompose the T matrix as the sum of two matrices. The first, W, is the within groups variability or matrix of deviations with respect to the means of each group, and is given by (10.18). The second measures the between groups variability, explained by the differences between means and which we will denote by B. This decomposition generalizes the classic decomposition of the analysis of variance to the vectorial case. In order to obtain it we add and subtract the group means in the equation ofT, as:

T= XG

g=1 ng

X

h=1

(xgh¯x+x¯g¯xg)(xgh¯x+¯xg¯xg)0

and expanding on this still further it is easy to see that the double product is cancelled and the result is:

T=B+W, (10.25)

whereT is given by (10.24), W by (10.18) andB is calculated by:

B= XG

g=1

ng(¯xgx)(¯¯ xg¯x)0. The decomposition (10.25) can be expressed as

Total Variability (T) = Explained Variability (B) + Residual Variability (W) which is the usual decomposition of the analysis of variance.

(17)

In order to test whether the means are equal we can compare the size of the matricesTandB.The measurement of their size is the determinant, and the test is based on the ratio |T|/|W|. The exact distribution of this ratio was studied by Wilks and can be approximated to an F distribution (see appendix 10.4). For moderate sizes, the test is similar to the likelihood ratio test (10.23), which can also be written as:

λ0=mlog |T|

|W| =mlog|W+B|

|W| =mlog|I+W1B| (10.26) From the point of view of calculating (10.26) as|I+A|= Π(1 +λi) where λi are the eigenvectors ofA, this statistic reduces to

λ0 =mX

log(1 +λi) whereλi are the eigenvectors of the matrixW1B.

Example: We are going to apply this test to see if detectable differences are observed in small samples in the MEDIFIS data, between the body measurements of men and women in Table A.5. In the sample there are 15 women (sex variable = 0) and 12 men (sex = 1). The first step in the analysis is to calculate the means and covariance matrices of each group, separately, and for the data set. The following table shows measurements for each variable, for the whole sample, and for groups of men and women.

ht wt f tl arml bwth crd kn−al total 168.78 63.89 38.98 73.46 45.85 57.24 43.09 women 161.73 55.60 36.83 70.03 43.33 56.63 41.06 men 177.58 74.25 41.67 77.75 49.00 58.00 45.62

The covariance matrices dividing by n−1 for the whole sample, men and women, are

For women:

SbM =









 37.64

22.10 80.40 6.38 7.36 1.92 15.65 12.94 3.06 7.41

9.49 14.39 1.49 3.99 9.42 2.75 7.20 0.76 1.17 2.559 2.94 9.02 9.31 1.98 4.53 1.12 0.95 3.78









For men:

SbH =









 45.53

48.84 74.20 9.48 9.63 2.79

14.34 19.34 2.09 12.57 14.86 19.77 3.23 6.18 6.77

9.45 9.90 1.86 2.36 3.02 3.13 8.92 5.23 2.31 1.21 1.84 2.63 6.14









(18)

and for the set of men and women, we calculate it as a weighted mean of these two matrices

SbT = (14dSM + 11ScH)/25 which gives us

SbT =









 41.11

33.86 77.67 7.476 8.36 2.30 15.07 15.76 2.63 9.68 11.85 16.76 2.25 4.95 8.25

5.70 8.390 1.24 1.70 2.76 3.03 8.98 7.52 2.13 3.07 1.44 1.70 4.82









We are going to calculate the likelihood ratios as a quotient of the average variabilities against the hypothesis. Under H0 we find that the covariance matrix, when we assume the same mean,S,leads to the average variability

V P(H0) =|S|1/7 = 5.77 whereas

V P(H1) =|Sw|1/7= 4.67 thus the test is

27((271)(7 + 7)/2) log(5.77/4.67) = 108.5

which must be compared with aχ2 with 7 degrees of freedom, and there is no doubt that the differences are significant.

10.8 Tests for outliers

The test of equality of means can be applied, as a particular case, in order to test whether an observation of a sample of normal data is an outlier. The null hypothesis will be that all of the data comes from the same normal population. The alternative hypothesis is that suspicious data has been generated by another, unknown population. In order to characterize this alternative population we can assume (1) that the mean is different, or (2) that the mean is the same and the variance is different. If we assume that both the mean and the covariance matrix are different we would have a problem of identification as it is impossible with only one piece of information to estimate both the mean and the variability. It can be proved that the tests which assume a different mean or different variance are similar (see Pe˜na and Guttman, 1993), and here we will look at the simplest case of a different mean but with the same covariance matrix. In order to apply this test to a suspicious piece of information,xi,we establish that:

H0:E(xi) =µ;

(19)

compared to

H1 :E(xi) =µi 6=µ;

The likelihood function under H0 is (10.17). Under H1, since the esti- mationµi isxi,the estimation of the variance is

S(i)= 1

n−1W(i), where

W(i)= Xn

h=1,j6=i

(xh¯x(i))(xh¯x(i))0,

is the estimation of the sum of the squares of the residuals, andx¯(i) is the mean of the observations. In both cases, the observation xi is eliminated.

The difference in supports is, specifying (10.26):

λ=nlog |T|

|W(i)|

and, it is shown in appendix 10.3 that the ratio is verified:

|T|

|W(i)| = 1 + 1

nD2(xi,¯x(i)) whereD2(xi,¯x(i)) is:

D2(xi,(i)) = (xi¯x(i))0S1(i)(xi¯x(i)). (10.27) the Mahalanobis distance between the piece of data and the measurement without including it. Thus, in order to carry out the test we calculate the Mahalanobis distance (10.27), which is distributed, if H0 is true, for large samples as a χ2p .

In practice, when we want to detect outliers we calculate the distances D2(xi,(i)) and this value is compared with the percentile 0.95 or 0.99 from the percentile tables of the maximum of a χ2p. The problem then is that if there is more than one outlier the power of the test can be quite low, as the estimations of the parameters may have been contaminated. A more recommendable procedure whenever we work with samples that might be heterogenous is to first identify all suspicious observations using the proce- dures indicated in Chapter 3, and then test them one by one. That is to say, we order all suspicious data by D2(xi,¯x(i)) and test whether the clos- est observation can be incorporated into the sample. If this incorporation is rejected, the procedure is terminated and all the suspicious data are de- clared outliers. In the opposite case, the observation is incorporated into the sample, the parameters and Mahalanobis distance are recalculated and the procedure is repeated with the remaining excluded observations.

(20)

10.9 Normality Tests

The most often used methods in multivariate analysis assume joint normality in the observations and, when we have enough data, this hypothesis should be tested.

Univariate normality

The normality of univariate distributions can be tested using aχ2, Kolmogorov- Smirnov, Shapiro and Wilks, or with tests based on coefficients of asymmetry and kurtosis.

A= m3

m3/22 ; K= m4 m22, where

mh= 1 n

X(xi−x)h.

Asymptotically, with normal data, it can be proved that:

A∼N(0; 6/n); K ∼N(3; 24/n) and thus the variable

X2= nA2

6 +n(K−3)2 24

will be distributed, if the hypothesis is true, as a χ2 with 2 degrees of freedom. We reject the hypothesis of normality ifX2> χ22(α).

Multivariate normality

Multivariate normality implies normality of marginal unidimensional distri- butions, but the presence of this property does not guarantee the multivari- ate normality of the data. To test joint normality there are several possible tests, and here we will only comment on the multivariate generalization of the asymmetry and kurtosis tests (See Justel, Pe˜na and Zamar (1997) for a generalization of the Kolmogorov-Smirnov test in multivariate cases).

Defining the asymmetric and kurtosis multivariate coefficients as in Sec- tion 3.6:

Ap= 1 n2

Xn

i=1

Xn

j=1

d3ij

Kp = 1 n

Xn

i=1

d2ii

wheredij = (xix)0S1(xix),it is asymptotically proved:

nAp/6∼χ2f con f = 1

6p(p+ 1)(p+ 2)

(21)

Kp ∼N(p(p+ 2); 8p(p+ 2)/n)

The power of this test is not high unless we have a very large sample size.

Two frequent cases in practice in which the hypothesis of joint normality is rejected are:

(1) The marginal distributions are approximately symmetric, and the relationships between the variables are linear, but there are atypical values that cannot be explained using the hypothesis of normality. In this case, if we eliminate (or exclude using a robust estimator) the outliers, the joint normality is not rejected and the methods based on normality can yield good results.

(2) Some of the marginal distributions are asymmetric and there are non-linear relationships between the variables. A simple solution, and one which works well in many cases is to transform the variables in order to obtain symmetry and linear relationships.

10.9.1 Transformations

For scalar variables Box and Cox (1964) suggested the following family of transformations for obtaining normality:

x(λ)= (x+m)

λ1

λ (λ6= 0) (x >−m) ln (x+m) (λ= 0) (m >0)

whereλis the transformation parameter which is estimated from the data, and the constantm is chosen so thatx+m is always positive. Thus,mwill be zero if we work with data which is either positive, or equal in absolute value to the most negative value observed, in the other case. Assuming m= 0 this family includes the logarithmic transformation, square root and the inverse. Whenλ >1, the transformation produces a greater separation or dispersion of the large values ofx, this being more pronounced the greater the value ofλ, whereas when λ <1 the effect is the opposite. In this case, the greater values ofx tend to concentrate and the smaller values (x <1) to disperse.

These transformations are very useful for marginal distributions. In order to study how to determine the value of a scalar variable, we assume thatm = 0 and that there is a value of λwhich transforms the variable to normal. The relationship between the model for the original data, x, and for the transformed datax(λ),is:

f(x) =f(x(λ))

¯¯

¯¯

¯ dx(λ)

dx

¯¯

¯¯

¯, (10.28)

and as:

dx(λ)

dx = λxλ−1

λ =xλ−1

(22)

and assuming that x(λ) is N(µ, σ2), for certain values of λ, the density function of the original variables is:

f(x) = 1 σ√

2πe

1 2σ2

µ

xλ−1 λ −µ

2

xλ−1

Therefore, the joint density function ofX= (xi, ..., xn),due to the indepen- dence of the observations, will be:

f(X) = 1

σn¡

2π¢n ÃYn

i=1

xλ−1i

! e

1 2σ2

Pµ

i1 λ −µ

2

(10.29) and the support function is:

L¡

λ;µ, σ2¢

=−n

2lnσ2−n

2 ln 2π+(λ−1)X

lnxi 1 2σ2

X µxλi 1 λ −µ

2 . In order to obtain this function, where λ is fixed, the values of σ2 and µ which maximize the likelihood (or the support) are, taking the derivative and setting to zero:

b

σ2(λ) = 1 n

X ³x(λ)−µb(λ)

´2 , b

µ(λ) = x(λ)=Xx(λ)i n = 1

n

X µxλi 1 λ

.

By substituting these values in the likelihood we obtain what is called the concentrated likelihood function in λ. Disregarding the constants, its expression is:

L(λ) =−n

2 lnσb(λ)2+ (λ−1)X

lnxi (10.30)

The procedure for obtaining bλconsists of calculatingL(λ) for different values of λ. The value which maximizes this function is the M Lestimator of the transformation.

For multivariate normality we assume the existence of a vector of pa- rametersλ= (λ1, ..., λp) which produces multivariate normality, whereλj is the transformation applied to componentj of the vector. Applying a sim- ilar analysis to the univariate case, the concentrated multivariate support function in the vector of the parameters of the transformation is:

L(λ) =−n 2ln

¯¯

¯bΣ

¯¯

¯+ Xp

j=1

"

(λj 1) Xn

i=1

lnxij

# ,

where the parameters have been estimated applying the usual formulas of transformed data:

b µ= 1

n Xn

i=1

x(λ)i ,

Referencias

Documento similar