Discrete kernel estimation and its applications

(1)

Discrete Kernel Estimation and Its Applications

∗

Nicol´

as Idrobo

†

May 24, 2013

Abstract

An important question in probability and statistics is the one that deals with the

underlying probability density function (PDF) of a given set of random independent

and identically distributed draws. It is of particular interest to be able to estimate such

PDF using the available data. This paper assesses this question using nonparametric

techniques for the case when the random variable is continuous, and also for the case

when it is discrete. Simple theoretical results are shown with the purpose of studying

the empirical behavior of such estimations. Finally, the machinery developed for PDF

estimation on discrete random variables serves as the basis for an introduction to

non-parametric regression estimation.

∗_{This document is presented as a final paper to obtain the bachelor degree in Mathematics. I am deeply}

grateful to Adolfo Quiroz, my advisor, who gave me excellence guidance throughout the whole process. I am

also grateful to Sebastián Mart´ınez, Román David Zárate and David Zarruk for reading preliminary versions

of this paper. Finally, all remaining errors are my own.

(2)

1 Introduction

The objective in this paper is to introduce kernel methods, which in general, serve to

es-timate unknown functions. In particular, we will be interested in estimating probability

density functions (PDF). We will use this type of functions to estimate PDFs because as

we will see, the problem arises natural in this context, and because the theory developed

in order to estimate PDFs will be of extreme importance later on when we try to estimate

more complex relations.

This methods have emerged recently as an alternative to parametric approaches, in which

strong assumptions about functional forms, and distribution of parameters are made. The

advantage of parametric methods is that they are much simpler, and that, if the assumptions

are correct, they tend to produce accurate results. On the other hand, if any of the

assump-tions fail, this methods may produce biased estimators that may lead to wrong conclusions.

The advantage of nonparametric methods is their flexibility and their lack of assumptions.

But this advantages are not for free. Nonparametric methods require some statistical

knowl-edge, some computer power, and lots of observations when the dimension of the problem is

high. This may be a low price to pay in order to obtain accurate conclusions, and that is

why this methods are becoming more widespread.

This paper is organized as follows. Section 2 presents the more intuitive setup for kernel

methods which is the case of estimating PDFs when the underlying random variable is

con-tinuous. Section 3 presents kernel estimation of PDFs when the random variable has a finite

support, and this serves to build upon the theory of nonparametric regression estimation

(3)

2 Kernel Estimation on Continuos Random Variables

The objective in this section is to approach the problem of finding the probability density

function (PDF) of a given continuous random variable X, when the only known information is a set ofnindependent and identically distributed realizations {X1, X2, ...Xn}ofX.

Tradi-tional parametric methods like Maximum Likelihood Estimation (MLE) are ment to discover

the underlying PDF of a given set of realizations using strong assumptions about its

distri-bution. This sort of circular argument is not always useful, and may lead to false conclusions.

In order to give a practical discussion on the subject, we will first expose the basic theory

of kernel estimation on continuous random variables, and later on we will develop an

ex-ample that compares MLE with kernel estimation. The results of such exex-ample will show

how accurate and easy to implement the kernel estimation method is, when compared to

traditional parametric approaches.

2.1 Univariate Density Estimation

Throughout this section we will expose the basic theory underlying univariate density

es-timation using kernel methods. Some definitions will be required in order to explain the

estimation technique.

Definition 1. Let Ω be a set such that Ω6=_∅. A collection of subsets of Ω will be called a

σ-algebra, and will be denoted byF if:

i.) Ω∈F

ii.) IfA ∈F, then Ac_∈F

iii.) IfA1, A2, ...∈F then ∪∞i=1Ai ∈F

Definition 2. Let Ω be a set such that Ω 6= _∅ and let F be a σ-algebra defined over Ω. The tuple (Ω,F) is called a measurable space.

Definition 3. Let (Ω,F) be a measurable space. A real valued function P defined overF is called a probability measure if it satisfies:

i.) P(A)≥0 for all A∈F.

(4)

iii.) IfA1, A2, ...belong to F and are pairwise disjoint:

Ai∩Aj =∅for all i6=j

It is true that:

P

∞

[

i=1

Ai

!

=

∞

X

i=1

P(Ai)

The tuple (Ω,F, P) is called aprobability space.

Definition 4. We define some notation regarding the order of a sequence. A real sequence

{an}n∈_N is of order O(1) if there exists C ∈ R such that |an| ≤ C for all n. Similarly, a

sequence is said to be of order O(bn), with{bn}n∈N another real sequence, if an/bn=O(1).

Definition 5. A real sequence {an}n∈N is said to be of order o(1) if an →0 when n→ ∞.

Similarly, it is said that an = o(bn), with {bn}n∈_N another real sequence, if an/bn → 0 as

n → ∞.

Definition 6. Let (Ω,F, P) be a probability space and let ( ˜Ω,F˜) be a measurable space. A F −F˜-random variable is an application X : Ω → Ω such that, for all˜ A ∈ F˜ it is true that X−1(A)∈F. If ( ˜Ω,F˜) = (_R,B), with B the Borel1 σ-algebra over _R, it is said that X is a real-valued random variable.

Theorem 1. LetX be a random variable defined over the probability space (Ω,F, P), who takes values in the measurable space (_R,B). The functionPX defined over B by:

PX(B) =P({X ∈B}), for all B ∈B

is a probability measure over (_R,B) called the probability density function (PDF) of the random variable X.

Proof: See Blanco (2004) for a complete proof.

For the rest of this section I will assume that Ω = _R and F = B. Also, I will use the

following notation arrangement:

f(x) =f(X =x) = PX(x) =P({X ∈ {x}})

Which means that f(x) will denote the probability that the random variable X takes the value x∈_R.

1_{The Borel} _σ_{-algebra in}

(5)

Definition 7. LetX be a real-value random variable. Thecumulative density function (CDF) of X evaluated at x is called F(x) and is defined as:

F(x) = PX((−∞, x]) =P[X ≤x] (1)

where P refers to the probability of the event in brackets.

With the above definitions it is possible to set the basic estimation problem. Let’s assume

that we havenindependent and identically distributed (i.i.d.) realizations of a given random variable X, given by {X1, X2, ..., Xn}. We do not know the PDF nor the CDF related to

X. Is it possible to estimate somehow the true PDF and the true CDF using just the realizations on hand? The answer is yes. Next definition provides an intuitive estimator for

the real CDF, which is the first approach to the problem.

Definition 8. With a set of i.i.d. data X1, X2, ..., Xn it is possible to estimate F(x) by:

Fn(x) =

|{i:Xi ≤x}|

n (2)

The above function Fn(x) is called the frequency estimatorofF(x) and simply takes into

account the proportion of draws that lie below x in order to estimate F(x).

The frequency estimator described above is intuitive in the sense that it just counts to

esti-mate. It is fully nonparametric because the only assumption is that the data is independent

and identically distributed. A natural question is if it is possible to construct an estimator

for f(x) with the same basic idea of counting. Next definition states such estimator for the PDF.

Definition 9. Let X1, X2, ..., Xn be random i.i.d realizations of the random variable X. It

is possible to estimate the probability that X equals x by:

ˆ

f(x) = |{i:Xi =x}|

n (3)

The above estimator is also a frequency estimator since it just counts the number of

coinci-dences of the draws and the valuex∈_R. In the case thatXis a continuous random variable, this estimator is not powerful since the set {i:Xi =x}has zero Lebesgue measure in R. It

is possible to produce a not so na¨ıve estimator for the PDF of X using the estimator Fn(x).

Before being able to define such estimator it is necessary to express the relation between

F(x) and f(x). IfX is a continuous random variable andF(x) is continuous, it is true that:

f(x) = d

(6)

In the case that F(x) is not differentiable f(x) would be given by:

f(x) = F(x+h)−F(x−h)

2h (4)

for h≈0. This is precisely what must be used to relate the CDF frequency estimator in (2) to the analogue PDF. Using (2) and (4) we have that:

ˆ

fn(x) =

Fn(x+h)−Fn(x−h)

2h (5)

The above estimator also counts in order to find an estimated value of f(x). It counts the number of events that fall in a small neighbor (of radio h) of x. Using the definition of the frequency estimator contained in (2) and the estimator just defined in (5) we get:

ˆ

f(x) = 1 2h

_|{

i:Xi ≤x+h}|

n −

|{i:Xi ≤x−h}|

n

=⇒fˆ(x) = |{i:Xi ∈[x−h, x+h]}|

2nh (6)

The above estimator seems better defined than (3) in the sense that it takes into account if

the random draw Xi is near x, instead of just looking if it is exactly equal tox, which is an

unlikely event in _R. That is precisely the spirit of the kernel estimator for f(x) that I will define next.

Definition 10. A uniform kernel function uniform kernel function k(z), for z ∈ _R, is defined by:

k(z) =

 



1/2 if |z| ≤1 0 if |z|>1

Using the definition of k(.) and (6) it is possible to define auniform kernel estimatorfor

f(x) given by:

ˆ

f(x) = 1

nh

n

X

i=1

k

Xi−x

h

(7)

Where h is a small suitable parameter.

Kernel functions do not need to be uniform, but for the sake of simplicity this is usually the

case. The conditions that any kernel function should satisfy in order to have consistency of ˆ

f(x) are the following:

i.) R

k(v)dv = 1

(7)

iii.) R v2k(v)dv >0

Theorem 2. The estimator ˆf(x) defined in (7) integrates up to one.

Proof: We directly integrate ˆf(x), taking into account the conditions assumed for k(.):

Z

ˆ

f(x)dx =

Z " ₁

nh n X i=1 k

Xi−x

h # dx = 1 nh n X i=1 Z k

Xi−x

h

dx

We define z = x−Xi

h , so that hdz =dx. Therefore:

Z

ˆ

f(x)dx = 1

nh

n

X

i=1 Z

k(−z)hdz

= 1 nh n X i=1 h Z

k(−z)dz

= 1 nh n X i=1 h = 1 nhnh = 1

We are now interested in studying the bias and the variance of the estimator defined in (7).

These two quantities, and particularly the properties of them asn gets bigger, are important in order to define the quality of the estimator in relation to the true PDF.

Theorem 3. Let ˆf(x) be the estimator defined in (7). The bias and the variance of this estimator are given by:

bias( ˆf(x)) = h

2

2 f

(2)₍_x₎ Z

v2k(v)dv+O(h3) (8)

var( ˆf(x)) = 1

nh[κf(x) +O(h)] (9)

Where κ=R k2(v)dv and f(n) refers to the n-th derivative of f.

Proof

(8)

• The bias of the estimated PDF is given by E[ ˆf(x)]−f(x). Therefore:

bias( ˆf(x)) = E

" 1 nh n X i=1 k

Xi −x

h

#

−f(x)

= 1 nh n X i=1 E k

Xi −x

h

−f(x)

= 1

hE

k

X1−x

h

−f(x) (by i.i.d.)

= 1

h

Z

f(x1)k

x1−x

h

dx1−f(x)

= 1

h

Z

f(x+hv)k(v)hdv−f(x) (change of var.)

=

Z

f(x) +f(1)(x)hv+ 1 2f

(2)₍_x₎_h2_v2₊_O₍_h3₎

k(v)dv

− −f(x) (Taylor expansion)

= h

2

2f

(2)₍_x₎ Z

v2k(v)dv+O(h3) (properties of k(.))

• The proof of the variance also follows from the definition:

var( ˆf(x)) = var

" 1 nh n X i=1 k

Xi−x

h

#

= 1

n2_h2

n X i=1 var k

Xi−x

h

(by i.i.d.)

= 1

nh2var

k

X1−x

h (by i.i.d.) = 1 nh2 " E k2

X1−x

h −E k

X1−x

h 2# = 1 nh2 " Z

f(x1)k2

x1−x

h

dx1− Z

f(x1)k

x1−x

h dx1 2# = 1 nh2 " Z

f(x+hv)k2(v)hdv− Z

f(x+hv)k(v)hdv

2# = 1 nh2 h Z

f(x) +f(1)(w)hvk2(v)dv−O(h2)

= 1

nh[κf(x) +O(h)]

With κ=R k2(v)dv.

The above theorem is important because it provides a central limit theorem for the estimator

(9)

importance to understand that a small bandwidth h will reduce the bias and increase the variance of the estimator. In order to choose an appropriate value of h one needs to find a value that offsets this tradeoff. A large body of literature is oriented to this subject, and we

will not cover it in this case, but only in the discrete case of next section.

2.2 Empirical comparison of parametric and nonparametric

ap-proaches

The objective in this section is to show in a practical way how does a common parametric

technique like Maximum Likelihood Estimation (MLE) works, and to compare its estimates

with those obtained with a nonparametric approach like the kernel estimation method

pre-sented in the previous subsection. The problem with MLE is that, given a set of n random i.i.d. draws {X1, X2, ..., Xn} of the random variable X, it is mandatory to know a priori

the PDF of X in order to estimate the PDF of X. It is precisely because of this type of circular argument that MLE is not useful unless the real distribution ofX is known. On the contrary, if the distribution of X is not known, and one incorrectly specifies it in order to use MLE, serious estimation bias may arise, as I will show later in this subsection.

The way MLE works is as follows. LetX1, X2, ..., Xnbe a set ofnindependent and identically

distributed (i.i.d.) draws from a normal distribution with mean µ and variance σ2. Notice that the general form of the PDF function of a normal distribution is given by:

f(x) = 1

σ√2πexp{−

1

2(x−µ)

2_/σ2_}

So the problem behind finding the PDF related to the mentioned realizations simplifies to the

problem of finding the parametersµandσ2 that best describe the underlying distribution of thisnrealizations. This objective may be achieved using the method of Maximum Likelihood Estimation (MLE). Given the fact that the observations are i.i.d., their joint distribution is

given by:

L :=f(X1, X2, ..., Xn) = n

Y

i=1

1

σ√2πexp{−

1

2(Xi−µ)

2_/σ2_}

where L is called the likelihood function. Then,L is conditioned on the data and logarithm is taken in order to construct the log-likelihood function, which is given by:

`(µ, σ2) = lnL=−n

2 ln(2π)−

n

2ln(σ

2₎₋ 1

2σ2

n

X

i=1

(10)

First order conditions ∂`(µ, σ2)/∂µ = 0 and ∂`(µ, σ2)/∂σ2 = 0 guarantee that the function

`(µ, σ2_{) is maximized due to the fact that it is concave in its arguments. The parameters}

that maximize the log-likelihood function are:

ˆ

µ= 1

n

X

i=1

Xi and σˆ2 =

1

n

X

i=1

(Xi−µˆ)2

Finally, the estimated PDF is written as:

ˆ

f(x) = 1 ˆ

σ√2πexp{−

1

2(x−µˆ)

2_/_σ_ˆ2_}

So, the implementation of this method is simple and powerful when the underlying

distribu-tion of the random draws is known, but is unprecise when this distribudistribu-tion is unknown.

The objective for the rest of the section is to generate a set of random draws from a given

distribution, and then estimatef(x) using MLE and assuming and incorrect underlying dis-tribution. Then, I will compare MLE estimation with kernel estimation for the same set of

data.

The first step is to generate a random sample from a known distribution. We choose a Beta

distribution with shape parameters α = 3 and β = 2. The point with this distribution is that in general it is not symmetric around its mean. We generaten = 10,000 random draws from this distribution. Next table resumes the basic descriptive statistics of these draws:

Table 1:

Observations 10,000

Mean 0.598804

Mode 0.611891

Std. Deviation 0.198426

Minimum 0.012593

Maximum 0.99556

Given that the data comes from a Beta distribution with known parameters, the true

prob-ability for every x∈[0,1] is given by:

f(x;α, β) = x

α−1₍₁₋_x₎β−1 R1

0 uα

−1₍₁₋_u₎β−1_du

The objective now is to prove that if MLE is used assuming a wrong distribution, the

(11)

order to estimate with MLE. As the table shows, MLE produces the estimated parameters

ˆ

µ= 0.598804 and ˆσ2 _{= 0}_.₀_.₁₉₈₄₂₆2_{. With these two parameters, the estimated distribution}

is given by:

ˆ

f1(x) =

1

ˆ

σ√2πexp{−

1

2(x−µˆ)

2

/σˆ2_}

The natural question is how do the true PDF compares to the PDF that resulted from

applying MLE to the data. Next graph shows this:

Figure 1

0

.5

1

1.5

2

Probability

0 .2 .4 .6 .8 1

X

True PDF Normal MLE True PDF vs. Normal MLE

The above figure shows that if the distribution used to estimate with MLE is not the right

one, the estimated PDF may be completely mistaken. Assuming a normal distribution for

the MLE procedure is costly, since it implies that the support is the whole real line (and

that is not the case for the Beta distribution), it assumes unimodality, symmetry around its

mean, among other conditions. This may lead to inaccurate estimations of the true PDF. In

Economics, for example, it is frequently assumed that the data comes from a normal

distri-bution in order to estimate with MLE when the real distridistri-bution of the data is completely

unknown. The above graphic proves this approach wrong. Later on this section we will

develop on formal methods to compare the quality two estimators.

The next step is to estimate using the kernel method described in the previous subsection.

The estimated PDF will be given by:

ˆ

f2(x) =

1

nh

n

X

i=1

k

Xi−x

h

(12)

Where k(.) is the same as in Definition 5, and h will be equal to 0.05216. The theory on how to choose parameter h is referred to as cross-validation and we will explain it in the next chapter. For each x∈_R it is possible to estimate f(x) using the n random draws and equation (10). Next figure shows the relation between the true PDF and the estimated one

using the kernel method.

Figure 2

0

.5

1

1.5

2

Probability

0 .2 .4 .6 .8 1

X

True PDF Kernel PDF True PDF vs. Kernel PDF

It is clear that the kernel method produced a much better estimation than the MLE with

the wrong assumption. This is the case because the kernel method is nonparametric and we

did not need to know a priori the true distribution of the n random draws. Of course, the kernel method is sensitive to the parameter h. In fact, cross-validation methods are ment to choose the appropiate value of h. For a particular cross-validation method on continuous random variables see Bowman and Foster (1993).

In order to compare the quality of ˆf1(.) and ˆf2(.), we define a quality measure known as the

Mean Squared Error (MSE).

MSE1(h) =

1

|G| X

x∈G

ˆ

f1(x)−f(x) 2

MSE2(h) =

1

|G| X

x∈G

ˆ

f2(x, h)−f(x) 2

Where Grefers to the grid of the domain used to estimate. Since MSE1(h) does not depend

of h, it is constant for all values of h. Next figure shows the relation between these two quantities for different values of h:

(13)

Figure 3

0

.02

.04

.06

MSE

0 .1 .2 .3

h

Normal MLE MSE Kernel MSE MSE Comparison

The above method is a na¨ıve way to find a good value ofhin order to estimate. The diamond in the dashed line indicates the minimum value for MSE2(h). This is precisely the value that

was used in order to estimate in Figure 2.

So, the conclusion of this empirical exercise is that one may incur in serious bias when a

wrong assumption is taken if using MLE. In general this is the problem present on parametric

methods. On the contrary, kernel estimation is fully nonparametric, and with an appropriate

(14)

3 Kernel Estimation On Discrete Random Variables

The objective in this section is to describe the most common kernel method used to estimate

in the case when X has a finite support. In the general case, X may be a random multidi-mensional vector. Section 3.1 will expose the estimator for the general case, while Section

3.2 will treat the same estimator for the case when X has one dimension. The objective in Section 3.2 is to show the estimation technique using a basic setup, and to derive a central

limit theorem for the kernel estimator. Finally, Section 3.3 will show an application of this

method for the particular case of two dimensions.

3.1 General Case

Let X be an r-dimensional random vector, for r ≥ 1. In order to estimate the PDF of X, we will need a set of n random i.i.d. draws that will be denoted by X_id. This means that

i∈ {1,2, ..., n}. Since X isr-dimensional, we define index s, an indicator of the component of the vector. Therefore s ∈ {1,2, ..., r}. This means that Xd

is will refer to the sth

compo-nent ofX_id, andxd_s to thesth component ofxd, wherexdwill be a point of evaluation later on.

The support ofX is finite and discrete. Therefore we could define thatxd

s, Xisd ∈ {0,1, ..., cs−

1}. It is worthwhile to observe that the support of each component may be different in its elements and in its size. Throughout this section cs will denote the size of the support of xds

and Xd is.

We should not forget the objective of this section. Our interest is to estimate f(xd_{), where}

f(.) refers to the PDF function of the random vectorX. To this purpose we will construct an estimator ˆf(.), that will be the analogue to that for the continuous random variables found in (7).

Definition 11. A discrete kernel function for component s is defined by:

l(X_isd, xd_s, λs) =

 



1−λs if Xisd =xds

λs

cs−1

if Xd is 6=xds

(11)

Notice that if λs = 0, the function l(Xisd, xds, λs) becomes an indicator function. Also, if

(15)

Theorem 4. For s, xd_s, λs and cs fixed, the function l(Xisd, xds, λs) is a measure function

(integrates up to 1).

Proof

Adding over all possible values of X_isd, we get that:

cs−1

X

y=0

l(y, xd_s, λs) = (1−λs) + cs−1

X

y=0

y6=xds

λs

cs−1

= (1−λs) +

λs

cs−1

(cs−1)

= 1

We are now interested in defining a kernel function for the whole vector xd_. _{We shall}

remember that we are assuming a total of r components. The kernel function for the whole vector will make use of the kernel function by components that was defined above.

Definition 12. A vector kernel function ment to compare vectors by components is defined as:

L(X_id, xd, λ) =

r

Y

s=1

l(X_isd, xd_s, λs) (12)

=

r

Y

s=1

λs

cs−1

Nis(x)

(1−λs)1−Nis(x)

where Nis(x) is an indicator function defined as:

Nis(x) =

 



1 if Xd is 6=xds

0 if X_isd =xd_s

Having defined the vector kernel function we are ready to introduce the estimator for the

PDF of X.

Definition 13. A nonparametric kernel estimator for the PDF of the random vector X in the case that X has a finite support is given by:

ˆ

p(xd) = 1

n

X

i=1

L(X_id, xd, λ) (13)

In this point we will state that with the purpose of being able to show theoretical results

related to a central limit theorem for the estimator, and with the objective to show the

behavior of the estimator with an empirical exercise, we will focus on the one dimensional

(16)

3.2 Particular Case

LetX be a one dimensional random vector. Due to the fact thatX only has one dimension, the vector kernel function simplifies to the kernel function of the only component:

L(Xi, x, λ) =

 



1−λ if Xi =x

λ

c−1 if Xi 6=x

The PDF estimator defined in (13) for the case when X is a one dimensional random vector is given by:

ˆ

p(x) = 1

n

X

i=1

L(Xi, x, λ) (14)

We are now interested in studying the bias and the variance of the estimator defined above

in order to state that there exists a central limit theorem for this estimator. The next two

theorems will prove this.

Theorem 5. The bias of the estimator ˆp(x) is O(λ) and is given by:

bias(ˆp(x)) = λ

c−1[1−cp(x)]

Proof

The bias of ˆp(x) is defined as: bias(ˆp(x)) =Epˆ(x)−p(x). I will first find Epˆ(x):

Epˆ(x) = 1

n

X

i=1

EL(Xi, x, λ)

= EL(X, x, λ) (by i.i.d)

= X

y

p(y)L(y, x, λ)

= p(x)L(x, x, λ) +X

y6=x

P(y)L(y, x, λ)

= p(x)(1−λ) + λ

c−1[1−p(x)] (15) Therefore::

bias(ˆp(x)) = Epˆ(x)−p(x)

= λ

c−1[1−p(x)] + (1−λ)p(x)−p(x)

= λ

c−1[1−cp(x)] (16) It is clear by the way bias(ˆp(x)) depends on λ that this bias is O(λ).

(17)

It is also worthwhile to note that if p(x) = 1/c we have that bias(ˆp(x)) = 0 for any admisible value of λ. We now focus in the variance of the estimator.

Theorem 6. The variance of ˆp(x) is given by:

Var(ˆp(x)) = p(x)[1−p(x)]

n

1−λ c c−1

2

Proof

We will use the following definition of variance in order to prove its convergence:

Var(ˆp(x)) = Epˆ(x)2−[Epˆ(x)]2 (17)

Multiplying (14) by itself we have that:

ˆ

p(x)2 = 1

n2 "

X

i

L(Xi, x, λ)2+

X

i6=j

L(Xi, x, λ)L(Xj, x, λ)

#

Applying expected value operator we have that:

Epˆ(x)2 = 1

n2 "

X

i

EL(Xi, x, λ)2+

X

i6=j

EL(Xi, x, λ)L(Xj, x, λ)

#

= nEL(X, x, λ)

2

n2 +

n(n−1) [EL(X, x, λ)]2

n2

= 1

nEL(X, x, λ)

2 ₊

n−1

n

[EL(X, x, λ)]2 (18)

Using the definition of L(.) we get:

EL(X, x, λ)2 = (1−λ)2p(x) +X

y6=x

p(y)

λ c−1

2

= (1−λ)2p(x) + [1−p(x)]

λ c−1

2

(19)

Also,

EL(X, x, λ) = p(x)(1−λ) + [1−p(x)]

λ c−1

=⇒[EL(X, x, λ)]2 = p(x)2(1−λ)2+ 2p(x)(1−λ) [1−p(x)] λ

c−1 + [1−p(x)]2 λ

2

(18)

Therefore, replacing (19) and (20) in (18) we get that:

Epˆ(x)2 = 1

n

"

(1−λ)2p(x) + [1−p(x)]

λ c−1

2#

+

n−1

n p(x)

2

(1−λ)2+ 2p(x)(1−λ) [1−p(x)] λ

c−1+ [1−p(x)]

2 λ2

(c−1)2

= (1−λ)2p(x)

np(x) + 1−p(x)

n

+ [1−p(x)]

λ c−1

2

n−np(x) +p(x)

n

+ 2p(x)(1−λ)

λ c−1

[1−p(x)]

n−1

n

(21)

Multiplying (15) by itself we get that:

[Epˆ(x)]2 = (1−λ)2p(x)2

λ c−1

2

[1−p(x)]2+ 2(1−λ)

λ c−1

p(x) [1−p(x)] (22)

Finally, inserting (21) and (22) in (17) we get:

Var(ˆp(x)) = (1−λ)2p(x)

np(x) + 1−p(x)

n

+ [1−p(x)]

λ c−1

2

n−np(x) +p(x)

n

+ 2p(x)(1−λ)

λ c−1

[1−p(x)]

n−1

n

− (1−λ)2p(x)2

λ c−1

2

[1−p(x)]2−2(1−λ)

λ c−1

p(x) [1−p(x)]

Which can be simplified to:

Var(ˆp(x)) = p(x)[1−p(x)]

n

1−λ c c−1

2

(23)

The importance of these two theorems is that if λ=o(n−1/2_{), it is true that:} √

n pˆ(xd)−p(xd)→−d N 0, p(xd)(1−p(xd))

Which provides a central limit theorem for the kernel estimator defined in this subsection.

3.3 Numeric example

The purpose of this subsection is to provide an example of how the exposed theory works.

First, we will generate a random sample of a given random vector from which we know the

joint probability distribution of its components. Second, we will use the estimators described

(19)

3.3.1 Generating a random sample

Let X = (X1 X2) be a random vector. The support of X1 is discrete and given by S1 = {1,2,3}. In the same manner the support of X2 is given by S2 = {1,2,3,4,5}. The joint

probability function of X1 and X2 is given by:

Table 2:

X1

1 2 3

X2

1 1/10 1/11 1/12

2 1/13 1/14 1/15

3 1/16 1/17 1/18

4 1/19 1/20 1/21

5 1/22 1/23 q

Where q is such that:

X

z∈S1

X

w∈S2

P(X1 =z, X2 =w) = 1

In order to generate the first component ofX, we first need to find the marginal distribution of X1, which is given by:

P(X1 =k) = X

w∈S2

P(X1 =k, X2 =w) , for k∈S1

For a shorter notation let P1,k = P(X1 = k). Given the fact that k ∈ S1 = {1,2,3}, the

marginal distribution of X1 is made up of three probabilities: P1,1, P1,2 and P1,3.

Let’s assume that we want to generate a sample of n = 1,000 random draws of X. We will first generate a vector of n positions, that will be calledU. Let Ui be thei-th component of

U. Also, Ui ∼U([0,1]) for all i.

To generate a random sample of X, we will define the first component of thei−th draw by:

X1i =

   

  

1 if Ui < P1,1

2 if P1,1 ≤Ui < P1,1+P1,2

3 if Ui ≥P1,1+P1,2

(20)

X2i. For this purpose we need to calculate the conditional probability distribution of X2,

which is given by:

P(X2 =w|X1 =z) =

P(X1 =z, X2 =w)

P(X1 =z)

Once again, for a shorter notation I will define Pw|z = P(X2 = w|X1 = z). The second

component of Xi will be defined as follows:

X2i =

                  

1 if 0≤Ui <P1_i₌₁Pi|z and X1 =z

2 if P1

i=1Pi|z ≤Ui <

P2

i=1Pi|z and X1 =z

3 if P2

i=1Pi|z ≤Ui < P3

i=1Pi|z and X1 =z

4 if P3

i=1Pi|z ≤Ui < P4

i=1Pi|z and X1 =z

5 if P4

i=1Pi|z ≤Ui <

P5

i=1Pi|z and X1 =z

After following this procedure, we have conformed a random sample of the random vector

X, whose size is n.

3.3.2 Estimation

At this point, we have a random sample which consists of n = 1,000 draws of the random vector X. The objective is to estimate ˆp(x) using (13):

ˆ

p(xd) = 1

n

X

i=1

L(X_id, xd, λ)

To adequately estimate, we will use different values of λ, and repeating the procedure of generating a set of draws and estimating to obtain a robust estimator. Let index v represent the v-th repetition. Therefore, for λ and xd fixed, we can index the estimated probability

ˆ

p(xd_{) of repetition}_v _{by ˆ}_p

v(xd). If we consider a total of b repetitions for every possible value

of λ, we would have that the average estimator at xd _{would be:}

ˆ

p(xd) = 1

b

X

v=1

ˆ

pv(xd) (24)

In this point we are interested producing a measure of bias for different values of λ. This empirical bias will be function solely of λ and will be given by:

bias(ˆp(λ)) = 1

|S1×S2| X

xd_∈_S

1×S2

" 1 b b X v=1 ˆ

pv(xd)−p(xd)

#2

(25)

It is a measure of total bias since for a fixed λ, we estimate the bias in every point of the support, square it, and then average over all of the points of the support. Next figure shows

(21)

Figure 4

0

.0005

.001

.0015

.002

bias

0 .15 .3 .45 .6

λ

Bias

This figure shows how the bias is increasing inλ. This is consistent with the expression found for the one dimensional case in (16). It also shows that bias(ˆp(0)) = 0, which is intuitive since the vector kernel function becomes an indicator function when λ = 0, and this must produce the lowest possible bias.

It is now important to study the variance of the estimator for different values of λ. Once again, we will be doing b estimations, each time with a different sample, for a fixed λ. Afterwards, we move λ to the next possible value and repeat the procedure. The variance will be a function of λ, and will be given by:

var(ˆp(λ)) = 1

|S1×S2| X

xd_∈_S

1×S2





1

b

X

v=1

ˆ

pv(xd)−

1

b

X

k=1

ˆ

pk(xd)

!2

 (26)

(22)

Figure 5

0

.00001

.00002

.00003

variance

0 .15 .3 .45 .6

λ

Variance

The most important fact from the figure of the variance is that it is decreasing inλ. We also new that from the theoretical result found for the one dimensional case in (23).

Our brief discussion so far has showed that there is an inverse relation between our measure

of bias and our measure of variance. This is so, because for smaller values of λ the vector kernel function works like an indicator function and this reduces the bias but makes the

variance large. For larger values ofλ, this relation is reversed: the bias is big due to the fact that we give a similar weight to possible distinct values ofXd

i and xd, but variance becomes

smaller. Therefore, we are interested in defining a value of λ that offsets this tradeoff. For this purpose, we define next variable who is function of λ:

Θ(λ) = bias(ˆp(λ)) + var(ˆp(λ)) (27)

(23)

Figure 6

0

.0005

.001

.0015

.002

0 .15 .3 .45 .6

λ

Bias+Variance

The diamond in the graph indicates the smallest value of Θ(λ). This means that this func-tion has a minimum, and that this minimum is reached withλ >0. This is interesting, since it shows that the frequency estimator may have the lowest bias, but its variance is big and

this may offset the gain from a low bias. Thus, using a bigger λ may increase bias, but this increase is offset by a lower variance.

Finally, no discussion has been made about the bounds that appear in the horizontal axis

of the graphs shown before. Back in Section 3.1, when defining the kernel function by

components, we mentioned that λs could take values in

h

0,cs−1

cs

i

. For this exercise we only

use one λ for both components. Therefore, for this case in particular:

λ∈[0,min{(3−1)/3,(5−1)/5}]

3.4 Crossvalidation

The technique used in the previous subsection to find an optimal value of λ is useful but unlikely to be used in practice. This is due to the fact that in order to calculate the bias we

needed to use the true PDF. In general, we will not have the true PDF, but only a set of

random draws. That is why it is important to develop methods that do not make use of the

(24)

For the general case, Crossvalidation seeks to minimize the total mean squared error which

is given by:

In =

X

xd

ˆ

p(xd)−p(xd)2

= X

xd

ˆ

p(xd)2−2X

xd ˆ

p(xd)p(xd) +X

xd

p(xd)2

= I1n−2I2n+

X

xd

p(xd)2

Where I1n =P_xd

ˆ

p(xd₎2

and I2n = P_xdpˆ(xd)p(xd). Since

P

xd

p(xd₎2

does not depend

on λ, it suffices to minimize (I1n−I2n) in order to minimize In.

But I2n still depends on p(xd). It is possible to rewrite I2n as: I2n = E[ˆp(Xd)]. Therefore,

replacing the population mean by the sample mean we get:

ˆ

I2n=

1

n

X

i=1

ˆ

p−i(Xid) (28)

Where ˆp−i(Xid) =n−1

Pn

j=1,j6=iL(X d

i, Xjd, λ). Therefore, we finally have that cross-validation

seeks to minimize:

In =I1n−Iˆ2n (29)

choosing λ. For the example of the previous subsection, the graph of In as a function of λ

is given by:

Figure 7

0

.005

.01

.015

.02

.025

0 .15 .3 .45 .6

λ

(25)

The black dot indicates the point where In reaches its minimum. It is important to indicate

that In has a similar behavior to Θ(λ). The most important check is to see if both have the

same value of λ as the one that produces the minimum of the function. Such is the case as next graph shows:

Figure 8

0

.005

.01

.015

.02

.025

0 .15 .3 .45 .6

λ

Bias+Variance Crossvalidation Comparison

So, the conclusion is that the kernel estimator defined in Section 3.1 is a powerful tool in

the case that the random vector X has a discrete and finite support. It is easy to implement and there have been tools developed in order to find the best value of λ. Next section finally introduces to the theory and practice of nonparametric regression using what was exposed

(26)

4 Nonparametric regression

Two nonparametric estimation techniques have been exposed up to this point. The first

one, treated in Section 2, is designed to estimate the underlying PDF of a set of continuous

random i.i.d draws. The second one, exposed in Section 3, had the same objective for the

case when the underlying random vector had a finite support. Both of them proved to be

accurate and useful, particularly because of their nonparametric nature. No assumptions

about the true distribution were needed at all, and that was the main advantage of this

methods. But the objective is not only to be able to estimate PDFs. We are interested in

being able to study relations of the form:

Yi =g(Xid) +Ui (30)

Where Y is called a dependent variable, X is called an independent variable, g is an un-known function that relates X and Y, and U makes reference to an error associated to the relation. Also, it is assumed that E[Ui|Xid] = 0 and thatE[Ui2|Xid] =σ2(Xid), whereσ2(.) is

of unknown form.

There are many sensitive issues that can be examined with the relation posed in (30). For

example, if we have a sample of i.i.d. observations for n individuals given by {Yi, Xid}ni=1,

where Yi makes reference to the wage of individual iand Xid is a dummy2 variable for black

race, one could study the relation between wage and race. Public policy could be made about

discrimination issues with an estimation of this type. But caution must be hold, because

in order to be able to propose adequate conclusions about this relation, it is important to

guarantee that equation (30) is correctly specified. Failing to do so may produce completely

inaccurate conclusions, as is explained next.

Traditional linear regression methods serve as a first approach to relate X and Y. In par-ticular they assume that g(.) is linear and therefore:

Yi =g(Xid) +Ui =β0+β1Xid+Ui (31)

Where β0 and β1 are the parameters of interest. Let’s assume for a moment that Yi still

refers to the wage of individual i, but Xd

i refers to his age (and no longer to his race). Next

figure shows a possible scenario for this example:

2_{A dummy variable is a variable that only takes to values: zero and one. For the case of race, the variable}

(27)

Figure 9

0

20

40

60

80

Wage

0 10 20 30 40 50 60 70

Age

Linear g(.) Nonparametric g(.) Relation Between Age and Wage

The above figure shows that if the true relation is not linear, assuming a linear functional

form for g(.) could induce a completely misspecified estimation. With the linear estimation we would be asserting that the wage is monotonically increasing in the age, and it is common

knowledge that this is not true, since wage increases and reaches its peak at about 40 years,

but then decreases as the individual gets older and is less appreciated by the labor market.

So, the objective in this section is to introduce a nonparametric estimator for the function

g(.) in the context of equation (30). A couple of theoretical results will be presented, in order to understand the behavior of this estimator. Finally, an empirical exercise is presented in

order to show this estimator in practice.

4.1 Defining the estimator

The relationship that is going to be treated here is the one found in (30) for the case when

Xd

i and xd are r-dimensional vectors with finite support. All notation and kernel functions

for this setup was defined in Section 3.1. This machinery will serve as the building block of

the nonparametric estimator for g(.). In particular, a variation of (11) will be used as the kernel function by components:

l(X_isd, xd_s, λs) =

 



1 if X_isd =xd_s λs if Xisd 6=xds

(32)

Notice that if λs = 0, the above function becomes an indicator function. In the other hand,

(28)

satisfy: λs ∈[0,1]. This kernel function by components is not a measure, as it was the one

defined in (11). This is due to the fact that it does not add up to 1. No problem arises,

since we are not interested in estimating a PDF.

A vector kernel function is defined as the product of the kernel functions by components,

exactly as it was defined in (12):

L(X_id, xd, λ) =

n

Y

i=1

l(X_isd, xd_s, λ)

=

r

Y

s=1

λNis(x)

s (33)

Where Nis(x) = 1 if Xisd 6=Xsd, and 0 otherwise.

Definition 14. Anonparametric estimatorgˆ(.) is proposed to estimate the true function

g(.) of (30) as:

ˆ

g(xd) = n

−1Pn

i=1YiL(X

d

i, xd, λ)

ˆ

p(xd₎ (34)

Where ˆp(xd) = n−1Pn

i=1L(X

d

i, xd, λ).

Notice that if λ = 0 this estimator becomes a frequency estimator that takes the average values of Y at point xdas the image of ˆg(xd). We are now interested in stating a theoretical result about the convergence of ˆg(.), and this is done next.

4.2 Simple theoretical case

Lets consider the case when r = 1. Therefore the vector estimator collapses to the single component kernel function:

L(Xi, x, λ) =l(Xi, x, λ)

This implies that the estimator for ˆg(.) is simply given by:

ˆ

g(x) = 1

n

X

i=1

L(Xi, x, λ) =

1

n

X

i=1

l(Xi, x, λ)

Theorem 7. Having defined the estimator for g(.), it is true that:

ˆ

g(x)−g(x) = Op λ+n−1/2

(35)

Proof:

We will give a sketch of the proof. First of all, we define ˆm(x) as follows:

ˆ

(29)

Therefore, it is true that:

ˆ

g(x)−g(x) = mˆ(x) ˆ

p(x)

The first objective is to prove that: E[ ˆm(x)] =O(λ). Lets expand ˆm(x) using all definitions made up to this point:

ˆ

m(x) =

n−1Pn

i=1YiL(Xi, x, λ)

ˆ

p(x) −g(x)

ˆ

p(x)

= n−1

n

X

i=1

YiL(Xi, x, λ)−g(x)ˆp(x)

= 1

n

X

i=1

YiL(Xi, x, λ)−g(x)

1

n

X

i=1

L(Xi, x, λ)

= 1

n

X

i=1

[Yi−g(x)]L(Xi, x, λ)

There exists an index k ∈ {1,2, ..., n} such that Xk =x. We should also keep in mind that

Yi =g(Xi) +Ui and that E[Ui|Xi] = 0. Therefore, applying the expected value operator we

have:

E[ ˆm(x)] = 1

n[g(Xk)−g(x)] +

1

n

X

i6=k

[g(Xi)−g(x)]λ

= 1

n

X

i6=k

[g(Xi)−g(x)]λ

Let R= max{|g(Xi)−g(x)|:i6=k}. Therefore:

E[ ˆm(x)] = 1

n

X

i6=k

[g(Xi)−g(x)]λ≤

n−1

n Rλ < Rλ

Which proves that E[ ˆm(x)] =O(λ). With a similar procedure it is relatively easy to prove that Var( ˆm(x)) =O(n−1). Both results imply that3:

E

ˆ

m(x)2

=O(λ2+n−1)

Which implies that:

ˆ

m(x) = Op(λ+n−1)

Finally it is true that ˆp(x) =p(x) +op(1). So, putting all together we have:

ˆ

g(x)−g(x) = mˆ(x) ˆ

p(x) =

Op(λ+n−1/2)

p(x) +o1(1)

= Op(λ+n−1/2)

This theorem is important because it states that if λ −−−→n→∞ 0, the estimator ˆg(.) converges to the true function g(.) in probability.

(30)

4.3 Empirical estimation

In this subsection we are interested in estimating g(.) using a set of n random i.i.d. draws. For practical purposes, we will assume that X is a one dimensional vector, and that X and

Y are related through (30).

We randomly generate the tuples{(Xi, Yi)}ni=1 in the following way. First, we fix n= 1,000,

which means that our random sample will be of this size. Second we generate Xi from the

normal distribution with mean 0 and variance 1004_{. Therefore}_X _∼_N₍₀_,_{100). Later on, we}

assume the following functional form between X and Y:

Yi =Xi3+X

2

i + 3

0

750,000Vi (36)

Where V ∼U([−1,1]). The idea with this equation is to produce a relation between X and

Y, introducing some random noise. It is important to notice that up to this point, we have generated {(Xi, Yi)}ni=1, but withX generated as a continuous random variable. In order to

be able to use nonparametric regression the way it has been presented here, we need X to have a finite support. That is why we discretize X and Y, rounding them to the nearest integer. Next figure shows this relation graphically:

Figure 10

−40000000

−20000000

0

20000000

40000000

Y

−500 −400 −300 −200 −100 0 100 200 300 400 500 X

Scatterplot of the Random Draws

So the objective is to be able to estimate this relation presented in the figure above, with

the nonparametric estimator defined in this section. To this purpose we use the estimator

4_{This is done as follows: a vector of}_n_{components taken from the uniform distribution in [0}_,_{1] is created.}

(31)

defined in (30), which is given by:

ˆ

g(xd) = n

−1Pn

i=1YiL(X

d

i, xd, λ)

ˆ

p(xd₎

In order to estimate g(.) for every value in the support of X. Notice that a different value of λ produces a different estimation. Therefore we could say that ˆg(Xi) = ˆg(Xi, λ). That is

why we define a Mean Squared Error of the form:

MSE(λ) = 1

n

X

i=1

(Yi−ˆg(Xi, λ))

2

(37)

Next figure shows this MSE for λ∈[0,1]:

Figure 11

0

5.000e+12

1.000e+13

1.500e+13

MSE

0 .2 .4 .6 .8 1

λ

Total MSE

So the lowest MSE is produced by λ = 0. This means that in this case, the frequency estimator is the best estimator in the context of (30). This might occur because we are

not producing estimates out of sample. The point is that no estimation out of sample is

needed because inside the support of X, there is positive mass for every possible value of the support. Estimating with λ = 0 we have that:

(32)

Figure 12

−40000000

−20000000

0

20000000

40000000

Y

−500 −400 −300 −200 −100 0 100 200 300 400 500 X

Real Data Prediction Real vs. Predicted Data

It is obvious that estimating with λ = 0 produced excellent results. Prediction is obtained with high accuracy. But we should acknowledge that λ = 0 may not always be the best alternative, and that in this case it is because we are not estimating out of sample. Another

issue that may induce λ > 0 would be a small sample size n. This would occur because with higher probability, there would be values of the support of X that would not have positive mass of occurrence with the n random draws. It is again possible to make use of Crossvalidation methods in order to choose λ. This will be explained next.

4.4 Cross-validation

Let’s consider the general r-dimensional case. Having studied thoretical properties about the nonparametric estimator for g(.), and having developed a numerical example to show its use, it is natural to ask about the best way to choose λ1, λ2, ..., λr. This can be done

minimizing the cross-validatory sum of squared residuals:

CV(λ) = 1

n

X

i=1

Yi−ˆg−i(Xid)

2

Where:

ˆ

g−i(Xid) =

n−1Pn

j6=iYjL(Xid, Xjd, λ)

ˆ

p−i(Xid)

And:

ˆ

p−i(Xid) =

1

n =

X

j6=i

(33)

This method is useful since it excludes the case i=j when calculating ˆg(.), and this is pre-cisely the case that was doing the frequency estimator a powerful estimator when estimating

in-sample. But when one is interested in estimating out of sample, λ plays a crucial role, and that is exactly what this cross-validation method is taking into account.

(34)

5 Conclusions

The objective in this paper was to show the way estimation with kernel methods work in

theory and in practice. This methods are useful to estimate PDFs of continuous and discrete

random variables, as was shown in Sections 2 and 3, particularly because they needed no

assumptions and were able to produce accurate results. We showed that parametric methods

with wrong assumptions could produce completely mistaken estimations. Finally, Section

4 introduced a richer subject: being able to estimate nonparametrically complex relations

between an univariate variable Y and a multivariate variable X, in the case that X had a finite support.

Kernel estimation methods are being used more frequently because of their precision

with-out strong assumptions. But costs are imposed in order to achieve such advantages. First,

kernel methods are not widely known, and the theory behind them needs more statistical

knowledge than traditional parametric methods. Second, they are more difficult to

imple-ment since most statistical packages still do not have routines on nonparametric methods.

Third, nonparametric methods may converge slower than parametric ones, and that is costly

in terms of time and computational needs. Even with this problems on the scene, this

meth-ods are powerful and accurate, and will become more used as common knowledge on them

(35)

6 References

Aitchison, John and Colin Aitken, “Multivariate Binary Discrimination by the Kernel Method”, Biometrika Trust, 2013.

Blanco, Liliana, “Probabilidad”, Universidad Nacional de Colombia, 2004.

Bowman, Adrian. and P.J. Foster, “Adaptive Smoothing and Density-Based Tests of Multivariate Normality”, Journal of the American Statistical Association, 1993.

Horowitz, Joel L., “Semiparametric and Nonparametric Methods in Econometrics”, Springer Series in Statistics, 2009.

Greene, William H., “Econometric Analysis”, Prentice Hall, Seventh edition, 2011.

Li, Qi and Jeffrey Scott Racine, “Nonparametric Econometrics”, Princeton University Press, 2007.

Woolridge, Jeffrey M., “Econometric Analysis of Cross Section and Panel Data”, MIT Press, Second Edition, 2011.