The likelihood function

(1)

Statistical Inference

In many problems, we may be interested in estimating the parameters of a distribution, f(x|θ), or in testing a hypothesis about these parameters. Various approaches are available:

• Classical inference

• Fiducial and likelihood based inference

• Bayesian inference

All of these approaches are based on the use of the likelihood function.

Statistics and Probability

(2)

The likelihood function

Suppose that x₁, . . . , x_n is a sample from some density f(x|θ). Then the likelihood function is defined to be

l(θ|x) = f(x|θ) =

n

Y

i=1

f(x_i|θ).

Example 25

Suppose that we take a sample of size n from the normal distribution, X ∼ N(µ, σ²). Then

l(µ, σ|x) =

n

Y

i=1

1 σ√

2π exp

− 1

2σ²(x_i − µ)²

= 1

σⁿ(2π)^n/2 exp − 1 2σ²

n

X

i=1

(x_i − µ)²

!

(3)

= 1

σⁿ(2π)^n/2 exp − 1 2σ²

n

X

i=1

(x_i − x¯ + ¯x − µ)²

!

= 1

σⁿ(2π)^n/2 exp − 1 2σ²

" _n X

i=1

(x_i − x)¯ ² + 2

n

X

i=1

(x_i − x)(¯¯ x − µ) + n(¯x − µ)²

#!

= 1

σⁿ(2π)^n/2 exp

− 1 2σ²

(n − 1)s² + n(¯x − µ)²

Often, particularly in classical inference, it is more usual to use the log-likelihood function, S(θ|x) = log(l(θ|x)).

Here, we have

S(µ, σ|x) = −n

2 log 2π − nlogσ − 1

2σ² (n − 1)s² + n(¯x − µ)² .

(4)

Classical inference

Classical statistical inference is based on the frequency definition of probability.

Under this approach, the unknown parameters of interest, θ are supposed to be fixed, and therefore, inference is based on the use of functions, T = g(X), which have good properties under repeated sampling.

We shall consider the basic approaches to three problems:

• point estimation

• interval estimation

• hypothesis testing

(5)

Classical point estimation

How should we choose an estimate for θ given a sample, x. A reasonable criterion would appear to be that the estimator, T = g(X), has good sampling properties, for example that it is unbiased, i.e.

E[T] = θ

This implies that if we carry out an experiment N times under the same conditions, then the sample estimates t₁, t₂, . . . , t_N will have a mean

t¯→ θ as N → ∞.

Example 26

We have seen that for a sample from the normal distribution, then X¯ and S² are unbiased estimators of µ and σ²

(6)

Unbiasedness on its own is not necessarily a sufficient criterion. If an estimator has a very large variance, then it is quite probable that for any given sample, the sample estimate may be very far away from θ. Thus, a more typical criterion for choosing an estimator is to consider the mean square error

m.s.e. = E[(T − θ)²].

Note that we have m.s.e. = E

(T − E[T] + E[T] − θ)²

= V [T] + (T − E[T])²

so we can see that this idea combines the criteria of low variance and low bias.

Example 27

In the case of the normal distribution, we have V [ ¯X] = σ²/n is the m.s.e. of X¯ as an estimator of µ.

(7)

Maximum likelihood estimation

The most popular way of choosing an estimator is to use the maximum likelihood approach.

The maximum likelihood estimator, θˆ, is defined so that l(ˆθ|X) ≥ l(θ|X) ∀ θ.

This can (usually) be derived by differentiating the log-likelihood function.

(8)

Example 28

In the normal case, we have S(µ, σ|x) = −n

2 log 2π − nlogσ − 1

2σ² (n − 1)s² + n(¯x − µ)²

∂S

∂µ = n

σ²(¯x − µ)

∂S

∂σ = −n

σ + 1

σ³ (n − 1)s² + n(¯x − µ)²

and setting both derivatives to zero to find the maximum gives µˆ = ¯x and ˆ

σ =

q(n−1)s²

n so that σˆ² = _n¹ Pn

i=1(x_i − x)¯ ² which is a biased estimate.

Maximum likelihood estimators have very good properties for large samples, so that for instance, as n → ∞, they are asymptotically unbiased, have minimum variance and hence m.s.e.

(9)

Interval estimation

The approach to interval estimation is to choose estimators l(X) and u(X) so that, a priori, P(l(X) < θ < u(X)) is fixed to some preassigned level,

P(l(X) < θ < u(X)) = 1 − α where typical values for α are 0.1, 0.05 or 0.01.

Clearly, the higher the value of α, the narrower the interval but then under repeated sampling, a higher proportion of the intervals generated will not contain the true value θ.

An interval constructed in this manner is called a 100(1 − α)% confidence interval.

(10)

Example 29

Suppose that σ² is known and that we wish to choose an interval estimator for µ. Then we know that

X¯ ∼ N

µ, σ² n

and therefore Z = _σ/^X^¯^−µ^√_n ∼ N(0,1).

Now, from properties of the normal distribution, P(|Z| < 1.96) = 0.95 so that

P

X¯−µ σ/√

n

< 1.96

= 0.95 and therefore P X¯ − 1.96σ/√

n < µ < X¯ + 1.96σ/√ n

= 0.95.

(11)

(Mis)interpretation

Assume we take a sample of size 9 from N(µ, 1) and that the sample mean is 2. Then a 95% confidence interval for µ is

(2 − 1.96/3,2 + 1.96/3) = (1.3467,2.6533).

How should we interpret this interval?

(12)

(Mis)interpretation

Assume we take a sample of size 9 from N(µ, 1) and that the sample mean is 2. Then a 95% confidence interval for µ is

(2 − 1.96/3,2 + 1.96/3) = (1.3467,2.6533).

How should we interpret this interval?

This does not mean that the probability that µ lies in this interval is 0.95. It just means that 95% of intervals constructed using this approach will contain µ.

(13)

Hypothesis testing

Example 30

We think that a coin may be biased in favour of heads. Therefore, we decide to throw the coin 12 times and we observe 9 heads. Does this provide evidence that the coin is biased?

(14)

Hypothesis testing

Example 30

We think that a coin may be biased in favour of heads. Therefore, we decide to throw the coin 12 times and we observe 9 heads. Does this provide evidence that the coin is biased?

Suppose that the true probability of heads is θ. Then the probability of observing exactly x heads is

P(x|θ) =

12 x

θ^x(1 − θ)^12−x.

In particular, the likelihood function is P(x = 9|θ) =

12 9

θ⁹(1 − θ)³

and the maximum likelihood estimate for θ is θˆ = ⁹ .

(15)

If the coin were unbiased, then the probability of observing at least 9 heads would be

P(x ≥ 9|θ = 0.5) =

12

X

x=9

12 x

θ^x(1 − θ)^12−x = 0.075.

This implies that there would be a 7.5% chance of observing at least as many heads as we have seen even if the coin was really unbiased.

(16)

Formally, when we wish to test whether there is evidence in favour of some experimental hypothesis of interest (H₁ : θ > 0.5), then we assume that this hypothesis is not true and fix the null hypothesis (H₀ : θ = 0.5) and calculate the p-value, i.e. the probability of observing data at least as extreme as the observed values assuming this. If this probability falls below some pre-specified level, e.g. α = 0.05, then we can decide to reject the null hypothesis in favour of the alternative or experimental hypothesis.

By designing the test procedure in this way, we can say that if we carry out the test many times, then we will correctly retain the null hypothesis 100(1 − α)%

of the time and wrongly reject the null hypothesis 100α% of the time.

(17)

Formally, when we wish to test whether there is evidence in favour of some experimental hypothesis of interest (H₁ : θ > 0.5), then we assume that this hypothesis is not true and fix the null hypothesis (H₀ : θ = 0.5) and calculate the p-value, i.e. the probability of observing data at least as extreme as the observed values assuming this. If this probability falls below some pre-specified level, e.g. α = 0.05, then we can decide to reject the null hypothesis in favour of the alternative or experimental hypothesis.

By designing the test procedure in this way, we can say that if we carry out the test many times, then we will correctly retain the null hypothesis 100(1 − α)%

of the time and wrongly reject the null hypothesis 100α% of the time.

Thus at a 5% significance level, we would not reject the hypothesis that the coin was unbiased.

(18)

A strange feature

Suppose that we designed a different experiment to test the bias of the coin and decided that we would keep throwing the coin until we observed 3 tails.

Suppose that the third tail was observed on the twelfth toss of the coin.

Then the likelihood function is negative binomial l(θ|x) =

11 9

θ⁹(1 − θ)³

and just as before, the maximum likelihood estimate of θ is 9/12.

(19)

A strange feature

Suppose that we designed a different experiment to test the bias of the coin and decided that we would keep throwing the coin until we observed 3 tails.

Suppose that the third tail was observed on the twelfth toss of the coin.

Then the likelihood function is negative binomial l(θ|x) =

11 9

θ⁹(1 − θ)³

and just as before, the maximum likelihood estimate of θ is 9/12.

(20)

However, if we now calculate the p-value for the test of H₀ : θ = 0.5 vs H₁ : θ > 0.5, we have that

p = P(at least 9 heads before the third tail is observed) and thus

p =

∞

X

x=9

x + 3 − 1 x

θ^x(1 − θ)³ = 0.0325

and now we would reject the hypothesis that the coin is unbiased at a 5%

level.

(21)

Fiducial inference and related methods

Fisher

Fiducial inference has the objective of defining a posterior measure of uncertainty for θ without the necessity of defining a prior measure. This approach was introduced by Fisher (1930).

(22)

Example 31

Let X ∼ N(µ, σ²). Suppose that we wish to carry out inference for µ assuming σ² is known.

We know that Z = _σ/^X^¯^−µ^√_n ∼ N(0,1). Then for any z, P(Z > z) = p(z) where p(z) is known. Fisher’s idea is to write

p(z) = P(Z > z)

= P

X¯ − µ σ/√

n > z

= P

µ < X¯ − σz

√n

and then define p(z) = P

µ < x¯ − ^√^σz_n

to be the fiducial probability that µ is less than x¯ − ^√^σz .

(23)

Problems with the fiducial approach

• The probability measure is transferred from the sample space to the parameter space. What is the justification for this?

• What happens if no pivotal statistic exists?

• It is unclear how to apply the fiducial approach in multidimensional problems.

In many cases, fiducial probability intervals coincide with Bayesian credible intervals given specific non informative prior distributions. In particular, structural inference, see Fraser (1968) corresponds to Bayesian inference using so called Haar prior distributions. However we will see that the Bayesian justification for such intervals is more coherent.

(24)

Bayesian inference

de Finetti

This stems originally from the ideas developed by Bayes and much of the modern theory comes from the work of de Finetti in the 1930’s.

(25)

Characteristics of Bayesian inference

Firstly, Bayesian inference depends directly on the subjective definition of probability.

We can all have our own probabilities for a given event:

P(head), P(rain tomorrow), P(Mike was born in 1962).

Our probabilities may be different as they are our own measures of the likelihood of given events.

Secondly, given a sample x, and a prior distribution f(θ) for θ, we can update our beliefs using Bayes theorem:

p(θ|x) = f(x|θ)p(θ) f(x)

∝ f(x|θ)p(θ) = l(θ|x)p(θ)

(26)

Estimation and credible intervals

For a Bayesian, estimation is treated as a decision problem. In a given situation, we should elect an estimator in order to minimize the loss that we expect to incur. Utility theory can be used to choose an optimal estimator.

A 95% credible interval for θ is an interval [a, b] such that our probability that θ lies in [a, b] is 95%.

Prediction is also straightforward. If Y is a new observation, then the predictive distribution of Y is

f(y|x) = Z

f(y|(x),θ)p(θ|x) dθ

(27)

Bayesian analysis of the coin tossing example

Example 32

Suppose that our prior beliefs about θ are represented by a uniform distribution θ ∼ U(0,1).

The uniform distribution is an example of a beta distribution, p(φ|α, β) = 1

B(α, β)φ^α−1(1 − φ)^β−1 for 0 < φ < 1

where B(α, β) = ^Γ(α)Γ(β)_Γ(α+β) is the beta function. Setting α = β = 1 gives the uniform distribution. This is not a very realistic representation of typical prior knowledge. It would be more appropriate to use a symmetric beta distribution, e.g. B(5,5).

We can now calculate the posterior distribution via Bayes theorem.

(28)

From Bayes theorem, the posterior distribution is p(θ|x) ∝ 1 ×

12 9

θ⁹(1 − θ)³

∝ θ⁹(1 − θ)³ ∝ θ¹⁰⁻¹(1 − θ)⁴⁻¹

= 1

B(10,4)θ¹⁰⁻¹(1 − θ)⁴⁻¹ which implies that θ|x ∼ B(10,4).

It can now be demonstrated that P(θ ≤ 1/2|x) ≈ .046 and we might choose to reject the hypothesis that θ ≤ 0.5. Note however that this does not constitute a formal hypothesis test.

(29)

From the properties of the beta distribution, we know that if φ ∼ B(α, β), then E[φ] = _α+β^α .

Thus, in our case, we have

E[θ|x] = 10

10 + 4 = 5

7 and moreover, 5

7 = 1

7 × 1

2 + 6

7 × 9 12 which implies that

E[θ|x] = 1

7E[θ] + 6 7

θˆ

where E[θ] = 1/(1 + 1) = 1/2 is the prior mean and θˆ = 9/12 is the MLE of θ.

Thus, the posterior mean is a weighted average of the prior mean and the MLE.

(30)

Suppose that we wish to predict the number of heads, Y , in ten further tosses of the same coin. Thus, we have Y |θ ∼ BI(10, θ) and therefore,

f(y|x) = Z

f(y|x, θ)p(θ|x) dθ = Z

f(y|θ)p(θ|x)dθ

=

Z ¹

0

10 y

θ^y(1 − θ)^10−y × 1

B(10,4)θ¹⁰⁻¹(1 − θ)⁴⁻¹ dθ

=

10 y

1

B(10,4) × Z 1

0

θ^10+y−1(1 − θ)^14−y−1 dθ

=

10 y

B(10 + y,14 − y) B(10,4)

which is the so called beta-binomial distribution. The following diagram illustrates the predictive probability distribution of Y and the binomial predictive distribution (BI(10, .75)) derived from substituting the MLE, pˆ = 0.75, for p.

(31)

The predictive distribution

- 6

0 1 2 3 4 5 6 7 8 9 10 Y .1

.2 .3

p _Bayes

Classical