Discrete Kernel Estimation and Its Applications
∗
Nicol´
as Idrobo
†May 24, 2013
Abstract
An important question in probability and statistics is the one that deals with the
underlying probability density function (PDF) of a given set of random independent
and identically distributed draws. It is of particular interest to be able to estimate such
PDF using the available data. This paper assesses this question using nonparametric
techniques for the case when the random variable is continuous, and also for the case
when it is discrete. Simple theoretical results are shown with the purpose of studying
the empirical behavior of such estimations. Finally, the machinery developed for PDF
estimation on discrete random variables serves as the basis for an introduction to
non-parametric regression estimation.
∗This document is presented as a final paper to obtain the bachelor degree in Mathematics. I am deeply
grateful to Adolfo Quiroz, my advisor, who gave me excellence guidance throughout the whole process. I am
also grateful to Sebasti´an Mart´ınez, Rom´an David Z´arate and David Zarruk for reading preliminary versions
of this paper. Finally, all remaining errors are my own.
1
Introduction
The objective in this paper is to introduce kernel methods, which in general, serve to
es-timate unknown functions. In particular, we will be interested in estimating probability
density functions (PDF). We will use this type of functions to estimate PDFs because as
we will see, the problem arises natural in this context, and because the theory developed
in order to estimate PDFs will be of extreme importance later on when we try to estimate
more complex relations.
This methods have emerged recently as an alternative to parametric approaches, in which
strong assumptions about functional forms, and distribution of parameters are made. The
advantage of parametric methods is that they are much simpler, and that, if the assumptions
are correct, they tend to produce accurate results. On the other hand, if any of the
assump-tions fail, this methods may produce biased estimators that may lead to wrong conclusions.
The advantage of nonparametric methods is their flexibility and their lack of assumptions.
But this advantages are not for free. Nonparametric methods require some statistical
knowl-edge, some computer power, and lots of observations when the dimension of the problem is
high. This may be a low price to pay in order to obtain accurate conclusions, and that is
why this methods are becoming more widespread.
This paper is organized as follows. Section 2 presents the more intuitive setup for kernel
methods which is the case of estimating PDFs when the underlying random variable is
con-tinuous. Section 3 presents kernel estimation of PDFs when the random variable has a finite
support, and this serves to build upon the theory of nonparametric regression estimation
2
Kernel Estimation on Continuos Random Variables
The objective in this section is to approach the problem of finding the probability density
function (PDF) of a given continuous random variable X, when the only known information is a set ofnindependent and identically distributed realizations {X1, X2, ...Xn}ofX.
Tradi-tional parametric methods like Maximum Likelihood Estimation (MLE) are ment to discover
the underlying PDF of a given set of realizations using strong assumptions about its
distri-bution. This sort of circular argument is not always useful, and may lead to false conclusions.
In order to give a practical discussion on the subject, we will first expose the basic theory
of kernel estimation on continuous random variables, and later on we will develop an
ex-ample that compares MLE with kernel estimation. The results of such exex-ample will show
how accurate and easy to implement the kernel estimation method is, when compared to
traditional parametric approaches.
2.1
Univariate Density Estimation
Throughout this section we will expose the basic theory underlying univariate density
es-timation using kernel methods. Some definitions will be required in order to explain the
estimation technique.
Definition 1. Let Ω be a set such that Ω6=∅. A collection of subsets of Ω will be called a
σ-algebra, and will be denoted byF if:
i.) Ω∈F
ii.) IfA ∈F, then Ac∈F
iii.) IfA1, A2, ...∈F then ∪∞i=1Ai ∈F
Definition 2. Let Ω be a set such that Ω 6= ∅ and let F be a σ-algebra defined over Ω. The tuple (Ω,F) is called a measurable space.
Definition 3. Let (Ω,F) be a measurable space. A real valued function P defined overF is called a probability measure if it satisfies:
i.) P(A)≥0 for all A∈F.
iii.) IfA1, A2, ...belong to F and are pairwise disjoint:
Ai∩Aj =∅for all i6=j
It is true that:
P
∞
[
i=1
Ai
!
=
∞
X
i=1
P(Ai)
The tuple (Ω,F, P) is called aprobability space.
Definition 4. We define some notation regarding the order of a sequence. A real sequence
{an}n∈N is of order O(1) if there exists C ∈ R such that |an| ≤ C for all n. Similarly, a
sequence is said to be of order O(bn), with{bn}n∈N another real sequence, if an/bn=O(1).
Definition 5. A real sequence {an}n∈N is said to be of order o(1) if an →0 when n→ ∞.
Similarly, it is said that an = o(bn), with {bn}n∈N another real sequence, if an/bn → 0 as
n → ∞.
Definition 6. Let (Ω,F, P) be a probability space and let ( ˜Ω,F˜) be a measurable space. A F −F˜-random variable is an application X : Ω → Ω such that, for all˜ A ∈ F˜ it is true that X−1(A)∈F. If ( ˜Ω,F˜) = (R,B), with B the Borel1 σ-algebra over R, it is said that X is a real-valued random variable.
Theorem 1. LetX be a random variable defined over the probability space (Ω,F, P), who takes values in the measurable space (R,B). The functionPX defined over B by:
PX(B) =P({X ∈B}), for all B ∈B
is a probability measure over (R,B) called the probability density function (PDF) of the random variable X.
Proof: See Blanco (2004) for a complete proof.
For the rest of this section I will assume that Ω = R and F = B. Also, I will use the
following notation arrangement:
f(x) =f(X =x) = PX(x) =P({X ∈ {x}})
Which means that f(x) will denote the probability that the random variable X takes the value x∈R.
1The Borel σ-algebra in
Definition 7. LetX be a real-value random variable. Thecumulative density function (CDF) of X evaluated at x is called F(x) and is defined as:
F(x) = PX((−∞, x]) =P[X ≤x] (1)
where P refers to the probability of the event in brackets.
With the above definitions it is possible to set the basic estimation problem. Let’s assume
that we havenindependent and identically distributed (i.i.d.) realizations of a given random variable X, given by {X1, X2, ..., Xn}. We do not know the PDF nor the CDF related to
X. Is it possible to estimate somehow the true PDF and the true CDF using just the realizations on hand? The answer is yes. Next definition provides an intuitive estimator for
the real CDF, which is the first approach to the problem.
Definition 8. With a set of i.i.d. data X1, X2, ..., Xn it is possible to estimate F(x) by:
Fn(x) =
|{i:Xi ≤x}|
n (2)
The above function Fn(x) is called the frequency estimatorofF(x) and simply takes into
account the proportion of draws that lie below x in order to estimate F(x).
The frequency estimator described above is intuitive in the sense that it just counts to
esti-mate. It is fully nonparametric because the only assumption is that the data is independent
and identically distributed. A natural question is if it is possible to construct an estimator
for f(x) with the same basic idea of counting. Next definition states such estimator for the PDF.
Definition 9. Let X1, X2, ..., Xn be random i.i.d realizations of the random variable X. It
is possible to estimate the probability that X equals x by:
ˆ
f(x) = |{i:Xi =x}|
n (3)
The above estimator is also a frequency estimator since it just counts the number of
coinci-dences of the draws and the valuex∈R. In the case thatXis a continuous random variable, this estimator is not powerful since the set {i:Xi =x}has zero Lebesgue measure in R. It
is possible to produce a not so na¨ıve estimator for the PDF of X using the estimator Fn(x).
Before being able to define such estimator it is necessary to express the relation between
F(x) and f(x). IfX is a continuous random variable andF(x) is continuous, it is true that:
f(x) = d
In the case that F(x) is not differentiable f(x) would be given by:
f(x) = F(x+h)−F(x−h)
2h (4)
for h≈0. This is precisely what must be used to relate the CDF frequency estimator in (2) to the analogue PDF. Using (2) and (4) we have that:
ˆ
fn(x) =
Fn(x+h)−Fn(x−h)
2h (5)
The above estimator also counts in order to find an estimated value of f(x). It counts the number of events that fall in a small neighbor (of radio h) of x. Using the definition of the frequency estimator contained in (2) and the estimator just defined in (5) we get:
ˆ
f(x) = 1 2h
|{
i:Xi ≤x+h}|
n −
|{i:Xi ≤x−h}|
n
=⇒fˆ(x) = |{i:Xi ∈[x−h, x+h]}|
2nh (6)
The above estimator seems better defined than (3) in the sense that it takes into account if
the random draw Xi is near x, instead of just looking if it is exactly equal tox, which is an
unlikely event in R. That is precisely the spirit of the kernel estimator for f(x) that I will define next.
Definition 10. A uniform kernel function uniform kernel function k(z), for z ∈ R, is defined by:
k(z) =
1/2 if |z| ≤1 0 if |z|>1
Using the definition of k(.) and (6) it is possible to define auniform kernel estimatorfor
f(x) given by:
ˆ
f(x) = 1
nh
n
X
i=1
k
Xi−x
h
(7)
Where h is a small suitable parameter.
Kernel functions do not need to be uniform, but for the sake of simplicity this is usually the
case. The conditions that any kernel function should satisfy in order to have consistency of ˆ
f(x) are the following:
i.) R
k(v)dv = 1
iii.) R v2k(v)dv >0
Theorem 2. The estimator ˆf(x) defined in (7) integrates up to one.
Proof: We directly integrate ˆf(x), taking into account the conditions assumed for k(.):
Z
ˆ
f(x)dx =
Z " 1
nh n X i=1 k
Xi−x
h # dx = 1 nh n X i=1 Z k
Xi−x
h
dx
We define z = x−Xi
h , so that hdz =dx. Therefore:
Z
ˆ
f(x)dx = 1
nh
n
X
i=1 Z
k(−z)hdz
= 1 nh n X i=1 h Z
k(−z)dz
= 1 nh n X i=1 h = 1 nhnh = 1
We are now interested in studying the bias and the variance of the estimator defined in (7).
These two quantities, and particularly the properties of them asn gets bigger, are important in order to define the quality of the estimator in relation to the true PDF.
Theorem 3. Let ˆf(x) be the estimator defined in (7). The bias and the variance of this estimator are given by:
bias( ˆf(x)) = h
2
2 f
(2)(x) Z
v2k(v)dv+O(h3) (8)
var( ˆf(x)) = 1
nh[κf(x) +O(h)] (9)
Where κ=R k2(v)dv and f(n) refers to the n-th derivative of f.
Proof
• The bias of the estimated PDF is given by E[ ˆf(x)]−f(x). Therefore:
bias( ˆf(x)) = E
" 1 nh n X i=1 k
Xi −x
h
#
−f(x)
= 1 nh n X i=1 E k
Xi −x
h
−f(x)
= 1
hE
k
X1−x
h
−f(x) (by i.i.d.)
= 1
h
Z
f(x1)k
x1−x
h
dx1−f(x)
= 1
h
Z
f(x+hv)k(v)hdv−f(x) (change of var.)
=
Z
f(x) +f(1)(x)hv+ 1 2f
(2)(x)h2v2+O(h3)
k(v)dv
− −f(x) (Taylor expansion)
= h
2
2f
(2)(x) Z
v2k(v)dv+O(h3) (properties of k(.))
• The proof of the variance also follows from the definition:
var( ˆf(x)) = var
" 1 nh n X i=1 k
Xi−x
h
#
= 1
n2h2
n X i=1 var k
Xi−x
h
(by i.i.d.)
= 1
nh2var
k
X1−x
h (by i.i.d.) = 1 nh2 " E k2
X1−x
h −E k
X1−x
h 2# = 1 nh2 " Z
f(x1)k2
x1−x
h
dx1− Z
f(x1)k
x1−x
h dx1 2# = 1 nh2 " Z
f(x+hv)k2(v)hdv− Z
f(x+hv)k(v)hdv
2# = 1 nh2 h Z
f(x) +f(1)(w)hvk2(v)dv−O(h2)
= 1
nh[κf(x) +O(h)]
With κ=R k2(v)dv.
The above theorem is important because it provides a central limit theorem for the estimator
importance to understand that a small bandwidth h will reduce the bias and increase the variance of the estimator. In order to choose an appropriate value of h one needs to find a value that offsets this tradeoff. A large body of literature is oriented to this subject, and we
will not cover it in this case, but only in the discrete case of next section.
2.2
Empirical comparison of parametric and nonparametric
ap-proaches
The objective in this section is to show in a practical way how does a common parametric
technique like Maximum Likelihood Estimation (MLE) works, and to compare its estimates
with those obtained with a nonparametric approach like the kernel estimation method
pre-sented in the previous subsection. The problem with MLE is that, given a set of n random i.i.d. draws {X1, X2, ..., Xn} of the random variable X, it is mandatory to know a priori
the PDF of X in order to estimate the PDF of X. It is precisely because of this type of circular argument that MLE is not useful unless the real distribution ofX is known. On the contrary, if the distribution of X is not known, and one incorrectly specifies it in order to use MLE, serious estimation bias may arise, as I will show later in this subsection.
The way MLE works is as follows. LetX1, X2, ..., Xnbe a set ofnindependent and identically
distributed (i.i.d.) draws from a normal distribution with mean µ and variance σ2. Notice that the general form of the PDF function of a normal distribution is given by:
f(x) = 1
σ√2πexp{−
1
2(x−µ)
2/σ2}
So the problem behind finding the PDF related to the mentioned realizations simplifies to the
problem of finding the parametersµandσ2 that best describe the underlying distribution of thisnrealizations. This objective may be achieved using the method of Maximum Likelihood Estimation (MLE). Given the fact that the observations are i.i.d., their joint distribution is
given by:
L :=f(X1, X2, ..., Xn) = n
Y
i=1
1
σ√2πexp{−
1
2(Xi−µ)
2/σ2}
where L is called the likelihood function. Then,L is conditioned on the data and logarithm is taken in order to construct the log-likelihood function, which is given by:
`(µ, σ2) = lnL=−n
2 ln(2π)−
n
2ln(σ
2)− 1
2σ2
n
X
i=1
First order conditions ∂`(µ, σ2)/∂µ = 0 and ∂`(µ, σ2)/∂σ2 = 0 guarantee that the function
`(µ, σ2) is maximized due to the fact that it is concave in its arguments. The parameters
that maximize the log-likelihood function are:
ˆ
µ= 1
n
n
X
i=1
Xi and σˆ2 =
1
n
n
X
i=1
(Xi−µˆ)2
Finally, the estimated PDF is written as:
ˆ
f(x) = 1 ˆ
σ√2πexp{−
1
2(x−µˆ)
2/σˆ2}
So, the implementation of this method is simple and powerful when the underlying
distribu-tion of the random draws is known, but is unprecise when this distribudistribu-tion is unknown.
The objective for the rest of the section is to generate a set of random draws from a given
distribution, and then estimatef(x) using MLE and assuming and incorrect underlying dis-tribution. Then, I will compare MLE estimation with kernel estimation for the same set of
data.
The first step is to generate a random sample from a known distribution. We choose a Beta
distribution with shape parameters α = 3 and β = 2. The point with this distribution is that in general it is not symmetric around its mean. We generaten = 10,000 random draws from this distribution. Next table resumes the basic descriptive statistics of these draws:
Table 1:
Observations 10,000
Mean 0.598804
Mode 0.611891
Std. Deviation 0.198426
Minimum 0.012593
Maximum 0.99556
Given that the data comes from a Beta distribution with known parameters, the true
prob-ability for every x∈[0,1] is given by:
f(x;α, β) = x
α−1(1−x)β−1 R1
0 uα
−1(1−u)β−1du
The objective now is to prove that if MLE is used assuming a wrong distribution, the
order to estimate with MLE. As the table shows, MLE produces the estimated parameters
ˆ
µ= 0.598804 and ˆσ2 = 0.0.1984262. With these two parameters, the estimated distribution
is given by:
ˆ
f1(x) =
1
ˆ
σ√2πexp{−
1
2(x−µˆ)
2
/σˆ2}
The natural question is how do the true PDF compares to the PDF that resulted from
applying MLE to the data. Next graph shows this:
Figure 1
0
.5
1
1.5
2
Probability
0 .2 .4 .6 .8 1
X
True PDF Normal MLE True PDF vs. Normal MLE
The above figure shows that if the distribution used to estimate with MLE is not the right
one, the estimated PDF may be completely mistaken. Assuming a normal distribution for
the MLE procedure is costly, since it implies that the support is the whole real line (and
that is not the case for the Beta distribution), it assumes unimodality, symmetry around its
mean, among other conditions. This may lead to inaccurate estimations of the true PDF. In
Economics, for example, it is frequently assumed that the data comes from a normal
distri-bution in order to estimate with MLE when the real distridistri-bution of the data is completely
unknown. The above graphic proves this approach wrong. Later on this section we will
develop on formal methods to compare the quality two estimators.
The next step is to estimate using the kernel method described in the previous subsection.
The estimated PDF will be given by:
ˆ
f2(x) =
1
nh
n
X
i=1
k
Xi−x
h
Where k(.) is the same as in Definition 5, and h will be equal to 0.05216. The theory on how to choose parameter h is referred to as cross-validation and we will explain it in the next chapter. For each x∈R it is possible to estimate f(x) using the n random draws and equation (10). Next figure shows the relation between the true PDF and the estimated one
using the kernel method.
Figure 2
0
.5
1
1.5
2
Probability
0 .2 .4 .6 .8 1
X
True PDF Kernel PDF True PDF vs. Kernel PDF
It is clear that the kernel method produced a much better estimation than the MLE with
the wrong assumption. This is the case because the kernel method is nonparametric and we
did not need to know a priori the true distribution of the n random draws. Of course, the kernel method is sensitive to the parameter h. In fact, cross-validation methods are ment to choose the appropiate value of h. For a particular cross-validation method on continuous random variables see Bowman and Foster (1993).
In order to compare the quality of ˆf1(.) and ˆf2(.), we define a quality measure known as the
Mean Squared Error (MSE).
MSE1(h) =
1
|G| X
x∈G
ˆ
f1(x)−f(x) 2
MSE2(h) =
1
|G| X
x∈G
ˆ
f2(x, h)−f(x) 2
Where Grefers to the grid of the domain used to estimate. Since MSE1(h) does not depend
of h, it is constant for all values of h. Next figure shows the relation between these two quantities for different values of h:
Figure 3
0
.02
.04
.06
MSE
0 .1 .2 .3
h
Normal MLE MSE Kernel MSE MSE Comparison
The above method is a na¨ıve way to find a good value ofhin order to estimate. The diamond in the dashed line indicates the minimum value for MSE2(h). This is precisely the value that
was used in order to estimate in Figure 2.
So, the conclusion of this empirical exercise is that one may incur in serious bias when a
wrong assumption is taken if using MLE. In general this is the problem present on parametric
methods. On the contrary, kernel estimation is fully nonparametric, and with an appropriate
3
Kernel Estimation On Discrete Random Variables
The objective in this section is to describe the most common kernel method used to estimate
in the case when X has a finite support. In the general case, X may be a random multidi-mensional vector. Section 3.1 will expose the estimator for the general case, while Section
3.2 will treat the same estimator for the case when X has one dimension. The objective in Section 3.2 is to show the estimation technique using a basic setup, and to derive a central
limit theorem for the kernel estimator. Finally, Section 3.3 will show an application of this
method for the particular case of two dimensions.
3.1
General Case
Let X be an r-dimensional random vector, for r ≥ 1. In order to estimate the PDF of X, we will need a set of n random i.i.d. draws that will be denoted by Xid. This means that
i∈ {1,2, ..., n}. Since X isr-dimensional, we define index s, an indicator of the component of the vector. Therefore s ∈ {1,2, ..., r}. This means that Xd
is will refer to the sth
compo-nent ofXid, andxds to thesth component ofxd, wherexdwill be a point of evaluation later on.
The support ofX is finite and discrete. Therefore we could define thatxd
s, Xisd ∈ {0,1, ..., cs−
1}. It is worthwhile to observe that the support of each component may be different in its elements and in its size. Throughout this section cs will denote the size of the support of xds
and Xd is.
We should not forget the objective of this section. Our interest is to estimate f(xd), where
f(.) refers to the PDF function of the random vectorX. To this purpose we will construct an estimator ˆf(.), that will be the analogue to that for the continuous random variables found in (7).
Definition 11. A discrete kernel function for component s is defined by:
l(Xisd, xds, λs) =
1−λs if Xisd =xds
λs
cs−1
if Xd is 6=xds
(11)
Notice that if λs = 0, the function l(Xisd, xds, λs) becomes an indicator function. Also, if
Theorem 4. For s, xds, λs and cs fixed, the function l(Xisd, xds, λs) is a measure function
(integrates up to 1).
Proof
Adding over all possible values of Xisd, we get that:
cs−1
X
y=0
l(y, xds, λs) = (1−λs) + cs−1
X
y=0
y6=xds
λs
cs−1
= (1−λs) +
λs
cs−1
(cs−1)
= 1
We are now interested in defining a kernel function for the whole vector xd. We shall
remember that we are assuming a total of r components. The kernel function for the whole vector will make use of the kernel function by components that was defined above.
Definition 12. A vector kernel function ment to compare vectors by components is defined as:
L(Xid, xd, λ) =
r
Y
s=1
l(Xisd, xds, λs) (12)
=
r
Y
s=1
λs
cs−1
Nis(x)
(1−λs)1−Nis(x)
where Nis(x) is an indicator function defined as:
Nis(x) =
1 if Xd is 6=xds
0 if Xisd =xds
Having defined the vector kernel function we are ready to introduce the estimator for the
PDF of X.
Definition 13. A nonparametric kernel estimator for the PDF of the random vector X in the case that X has a finite support is given by:
ˆ
p(xd) = 1
n
n
X
i=1
L(Xid, xd, λ) (13)
In this point we will state that with the purpose of being able to show theoretical results
related to a central limit theorem for the estimator, and with the objective to show the
behavior of the estimator with an empirical exercise, we will focus on the one dimensional
3.2
Particular Case
LetX be a one dimensional random vector. Due to the fact thatX only has one dimension, the vector kernel function simplifies to the kernel function of the only component:
L(Xi, x, λ) =
1−λ if Xi =x
λ
c−1 if Xi 6=x
The PDF estimator defined in (13) for the case when X is a one dimensional random vector is given by:
ˆ
p(x) = 1
n
n
X
i=1
L(Xi, x, λ) (14)
We are now interested in studying the bias and the variance of the estimator defined above
in order to state that there exists a central limit theorem for this estimator. The next two
theorems will prove this.
Theorem 5. The bias of the estimator ˆp(x) is O(λ) and is given by:
bias(ˆp(x)) = λ
c−1[1−cp(x)]
Proof
The bias of ˆp(x) is defined as: bias(ˆp(x)) =Epˆ(x)−p(x). I will first find Epˆ(x):
Epˆ(x) = 1
n
n
X
i=1
EL(Xi, x, λ)
= EL(X, x, λ) (by i.i.d)
= X
y
p(y)L(y, x, λ)
= p(x)L(x, x, λ) +X
y6=x
P(y)L(y, x, λ)
= p(x)(1−λ) + λ
c−1[1−p(x)] (15) Therefore::
bias(ˆp(x)) = Epˆ(x)−p(x)
= λ
c−1[1−p(x)] + (1−λ)p(x)−p(x)
= λ
c−1[1−cp(x)] (16) It is clear by the way bias(ˆp(x)) depends on λ that this bias is O(λ).
It is also worthwhile to note that if p(x) = 1/c we have that bias(ˆp(x)) = 0 for any admisible value of λ. We now focus in the variance of the estimator.
Theorem 6. The variance of ˆp(x) is given by:
Var(ˆp(x)) = p(x)[1−p(x)]
n
1−λ c c−1
2
Proof
We will use the following definition of variance in order to prove its convergence:
Var(ˆp(x)) = Epˆ(x)2−[Epˆ(x)]2 (17)
Multiplying (14) by itself we have that:
ˆ
p(x)2 = 1
n2 "
X
i
L(Xi, x, λ)2+
X
i6=j
L(Xi, x, λ)L(Xj, x, λ)
#
Applying expected value operator we have that:
Epˆ(x)2 = 1
n2 "
X
i
EL(Xi, x, λ)2+
X
i6=j
EL(Xi, x, λ)L(Xj, x, λ)
#
= nEL(X, x, λ)
2
n2 +
n(n−1) [EL(X, x, λ)]2
n2
= 1
nEL(X, x, λ)
2 +
n−1
n
[EL(X, x, λ)]2 (18)
Using the definition of L(.) we get:
EL(X, x, λ)2 = (1−λ)2p(x) +X
y6=x
p(y)
λ c−1
2
= (1−λ)2p(x) + [1−p(x)]
λ c−1
2
(19)
Also,
EL(X, x, λ) = p(x)(1−λ) + [1−p(x)]
λ c−1
=⇒[EL(X, x, λ)]2 = p(x)2(1−λ)2+ 2p(x)(1−λ) [1−p(x)] λ
c−1 + [1−p(x)]2 λ
2
Therefore, replacing (19) and (20) in (18) we get that:
Epˆ(x)2 = 1
n
"
(1−λ)2p(x) + [1−p(x)]
λ c−1
2#
+
n−1
n p(x)
2
(1−λ)2+ 2p(x)(1−λ) [1−p(x)] λ
c−1+ [1−p(x)]
2 λ2
(c−1)2
= (1−λ)2p(x)
np(x) + 1−p(x)
n
+ [1−p(x)]
λ c−1
2
n−np(x) +p(x)
n
+ 2p(x)(1−λ)
λ c−1
[1−p(x)]
n−1
n
(21)
Multiplying (15) by itself we get that:
[Epˆ(x)]2 = (1−λ)2p(x)2
λ c−1
2
[1−p(x)]2+ 2(1−λ)
λ c−1
p(x) [1−p(x)] (22)
Finally, inserting (21) and (22) in (17) we get:
Var(ˆp(x)) = (1−λ)2p(x)
np(x) + 1−p(x)
n
+ [1−p(x)]
λ c−1
2
n−np(x) +p(x)
n
+ 2p(x)(1−λ)
λ c−1
[1−p(x)]
n−1
n
− (1−λ)2p(x)2
λ c−1
2
[1−p(x)]2−2(1−λ)
λ c−1
p(x) [1−p(x)]
Which can be simplified to:
Var(ˆp(x)) = p(x)[1−p(x)]
n
1−λ c c−1
2
(23)
The importance of these two theorems is that if λ=o(n−1/2), it is true that: √
n pˆ(xd)−p(xd)→−d N 0, p(xd)(1−p(xd))
Which provides a central limit theorem for the kernel estimator defined in this subsection.
3.3
Numeric example
The purpose of this subsection is to provide an example of how the exposed theory works.
First, we will generate a random sample of a given random vector from which we know the
joint probability distribution of its components. Second, we will use the estimators described
3.3.1 Generating a random sample
Let X = (X1 X2) be a random vector. The support of X1 is discrete and given by S1 = {1,2,3}. In the same manner the support of X2 is given by S2 = {1,2,3,4,5}. The joint
probability function of X1 and X2 is given by:
Table 2:
X1
1 2 3
X2
1 1/10 1/11 1/12
2 1/13 1/14 1/15
3 1/16 1/17 1/18
4 1/19 1/20 1/21
5 1/22 1/23 q
Where q is such that:
X
z∈S1
X
w∈S2
P(X1 =z, X2 =w) = 1
In order to generate the first component ofX, we first need to find the marginal distribution of X1, which is given by:
P(X1 =k) = X
w∈S2
P(X1 =k, X2 =w) , for k∈S1
For a shorter notation let P1,k = P(X1 = k). Given the fact that k ∈ S1 = {1,2,3}, the
marginal distribution of X1 is made up of three probabilities: P1,1, P1,2 and P1,3.
Let’s assume that we want to generate a sample of n = 1,000 random draws of X. We will first generate a vector of n positions, that will be calledU. Let Ui be thei-th component of
U. Also, Ui ∼U([0,1]) for all i.
To generate a random sample of X, we will define the first component of thei−th draw by:
X1i =
1 if Ui < P1,1
2 if P1,1 ≤Ui < P1,1+P1,2
3 if Ui ≥P1,1+P1,2
X2i. For this purpose we need to calculate the conditional probability distribution of X2,
which is given by:
P(X2 =w|X1 =z) =
P(X1 =z, X2 =w)
P(X1 =z)
Once again, for a shorter notation I will define Pw|z = P(X2 = w|X1 = z). The second
component of Xi will be defined as follows:
X2i =
1 if 0≤Ui <P1i=1Pi|z and X1 =z
2 if P1
i=1Pi|z ≤Ui <
P2
i=1Pi|z and X1 =z
3 if P2
i=1Pi|z ≤Ui < P3
i=1Pi|z and X1 =z
4 if P3
i=1Pi|z ≤Ui < P4
i=1Pi|z and X1 =z
5 if P4
i=1Pi|z ≤Ui <
P5
i=1Pi|z and X1 =z
After following this procedure, we have conformed a random sample of the random vector
X, whose size is n.
3.3.2 Estimation
At this point, we have a random sample which consists of n = 1,000 draws of the random vector X. The objective is to estimate ˆp(x) using (13):
ˆ
p(xd) = 1
n
n
X
i=1
L(Xid, xd, λ)
To adequately estimate, we will use different values of λ, and repeating the procedure of generating a set of draws and estimating to obtain a robust estimator. Let index v represent the v-th repetition. Therefore, for λ and xd fixed, we can index the estimated probability
ˆ
p(xd) of repetitionv by ˆp
v(xd). If we consider a total of b repetitions for every possible value
of λ, we would have that the average estimator at xd would be:
ˆ
p(xd) = 1
b
b
X
v=1
ˆ
pv(xd) (24)
In this point we are interested producing a measure of bias for different values of λ. This empirical bias will be function solely of λ and will be given by:
bias(ˆp(λ)) = 1
|S1×S2| X
xd∈S
1×S2
" 1 b b X v=1 ˆ
pv(xd)−p(xd)
#2
(25)
It is a measure of total bias since for a fixed λ, we estimate the bias in every point of the support, square it, and then average over all of the points of the support. Next figure shows
Figure 4
0
.0005
.001
.0015
.002
bias
0 .15 .3 .45 .6
λ
Bias
This figure shows how the bias is increasing inλ. This is consistent with the expression found for the one dimensional case in (16). It also shows that bias(ˆp(0)) = 0, which is intuitive since the vector kernel function becomes an indicator function when λ = 0, and this must produce the lowest possible bias.
It is now important to study the variance of the estimator for different values of λ. Once again, we will be doing b estimations, each time with a different sample, for a fixed λ. Afterwards, we move λ to the next possible value and repeat the procedure. The variance will be a function of λ, and will be given by:
var(ˆp(λ)) = 1
|S1×S2| X
xd∈S
1×S2
1
b
b
X
v=1
ˆ
pv(xd)−
1
b
b
X
k=1
ˆ
pk(xd)
!2
(26)
Figure 5
0
.00001
.00002
.00003
variance
0 .15 .3 .45 .6
λ
Variance
The most important fact from the figure of the variance is that it is decreasing inλ. We also new that from the theoretical result found for the one dimensional case in (23).
Our brief discussion so far has showed that there is an inverse relation between our measure
of bias and our measure of variance. This is so, because for smaller values of λ the vector kernel function works like an indicator function and this reduces the bias but makes the
variance large. For larger values ofλ, this relation is reversed: the bias is big due to the fact that we give a similar weight to possible distinct values ofXd
i and xd, but variance becomes
smaller. Therefore, we are interested in defining a value of λ that offsets this tradeoff. For this purpose, we define next variable who is function of λ:
Θ(λ) = bias(ˆp(λ)) + var(ˆp(λ)) (27)
Figure 6
0
.0005
.001
.0015
.002
0 .15 .3 .45 .6
λ
Bias+Variance
The diamond in the graph indicates the smallest value of Θ(λ). This means that this func-tion has a minimum, and that this minimum is reached withλ >0. This is interesting, since it shows that the frequency estimator may have the lowest bias, but its variance is big and
this may offset the gain from a low bias. Thus, using a bigger λ may increase bias, but this increase is offset by a lower variance.
Finally, no discussion has been made about the bounds that appear in the horizontal axis
of the graphs shown before. Back in Section 3.1, when defining the kernel function by
components, we mentioned that λs could take values in
h
0,cs−1
cs
i
. For this exercise we only
use one λ for both components. Therefore, for this case in particular:
λ∈[0,min{(3−1)/3,(5−1)/5}]
3.4
Crossvalidation
The technique used in the previous subsection to find an optimal value of λ is useful but unlikely to be used in practice. This is due to the fact that in order to calculate the bias we
needed to use the true PDF. In general, we will not have the true PDF, but only a set of
random draws. That is why it is important to develop methods that do not make use of the
For the general case, Crossvalidation seeks to minimize the total mean squared error which
is given by:
In =
X
xd
ˆ
p(xd)−p(xd)2
= X
xd
ˆ
p(xd)2−2X
xd ˆ
p(xd)p(xd) +X
xd
p(xd)2
= I1n−2I2n+
X
xd
p(xd)2
Where I1n =Pxd
ˆ
p(xd)2
and I2n = Pxdpˆ(xd)p(xd). Since
P
xd
p(xd)2
does not depend
on λ, it suffices to minimize (I1n−I2n) in order to minimize In.
But I2n still depends on p(xd). It is possible to rewrite I2n as: I2n = E[ˆp(Xd)]. Therefore,
replacing the population mean by the sample mean we get:
ˆ
I2n=
1
n
n
X
i=1
ˆ
p−i(Xid) (28)
Where ˆp−i(Xid) =n−1
Pn
j=1,j6=iL(X d
i, Xjd, λ). Therefore, we finally have that cross-validation
seeks to minimize:
In =I1n−Iˆ2n (29)
choosing λ. For the example of the previous subsection, the graph of In as a function of λ
is given by:
Figure 7
0
.005
.01
.015
.02
.025
0 .15 .3 .45 .6
λ
The black dot indicates the point where In reaches its minimum. It is important to indicate
that In has a similar behavior to Θ(λ). The most important check is to see if both have the
same value of λ as the one that produces the minimum of the function. Such is the case as next graph shows:
Figure 8
0
.005
.01
.015
.02
.025
0 .15 .3 .45 .6
λ
Bias+Variance Crossvalidation Comparison
So, the conclusion is that the kernel estimator defined in Section 3.1 is a powerful tool in
the case that the random vector X has a discrete and finite support. It is easy to implement and there have been tools developed in order to find the best value of λ. Next section finally introduces to the theory and practice of nonparametric regression using what was exposed
4
Nonparametric regression
Two nonparametric estimation techniques have been exposed up to this point. The first
one, treated in Section 2, is designed to estimate the underlying PDF of a set of continuous
random i.i.d draws. The second one, exposed in Section 3, had the same objective for the
case when the underlying random vector had a finite support. Both of them proved to be
accurate and useful, particularly because of their nonparametric nature. No assumptions
about the true distribution were needed at all, and that was the main advantage of this
methods. But the objective is not only to be able to estimate PDFs. We are interested in
being able to study relations of the form:
Yi =g(Xid) +Ui (30)
Where Y is called a dependent variable, X is called an independent variable, g is an un-known function that relates X and Y, and U makes reference to an error associated to the relation. Also, it is assumed that E[Ui|Xid] = 0 and thatE[Ui2|Xid] =σ2(Xid), whereσ2(.) is
of unknown form.
There are many sensitive issues that can be examined with the relation posed in (30). For
example, if we have a sample of i.i.d. observations for n individuals given by {Yi, Xid}ni=1,
where Yi makes reference to the wage of individual iand Xid is a dummy2 variable for black
race, one could study the relation between wage and race. Public policy could be made about
discrimination issues with an estimation of this type. But caution must be hold, because
in order to be able to propose adequate conclusions about this relation, it is important to
guarantee that equation (30) is correctly specified. Failing to do so may produce completely
inaccurate conclusions, as is explained next.
Traditional linear regression methods serve as a first approach to relate X and Y. In par-ticular they assume that g(.) is linear and therefore:
Yi =g(Xid) +Ui =β0+β1Xid+Ui (31)
Where β0 and β1 are the parameters of interest. Let’s assume for a moment that Yi still
refers to the wage of individual i, but Xd
i refers to his age (and no longer to his race). Next
figure shows a possible scenario for this example:
2A dummy variable is a variable that only takes to values: zero and one. For the case of race, the variable
Figure 9
0
20
40
60
80
Wage
0 10 20 30 40 50 60 70
Age
Linear g(.) Nonparametric g(.) Relation Between Age and Wage
The above figure shows that if the true relation is not linear, assuming a linear functional
form for g(.) could induce a completely misspecified estimation. With the linear estimation we would be asserting that the wage is monotonically increasing in the age, and it is common
knowledge that this is not true, since wage increases and reaches its peak at about 40 years,
but then decreases as the individual gets older and is less appreciated by the labor market.
So, the objective in this section is to introduce a nonparametric estimator for the function
g(.) in the context of equation (30). A couple of theoretical results will be presented, in order to understand the behavior of this estimator. Finally, an empirical exercise is presented in
order to show this estimator in practice.
4.1
Defining the estimator
The relationship that is going to be treated here is the one found in (30) for the case when
Xd
i and xd are r-dimensional vectors with finite support. All notation and kernel functions
for this setup was defined in Section 3.1. This machinery will serve as the building block of
the nonparametric estimator for g(.). In particular, a variation of (11) will be used as the kernel function by components:
l(Xisd, xds, λs) =
1 if Xisd =xds λs if Xisd 6=xds
(32)
Notice that if λs = 0, the above function becomes an indicator function. In the other hand,
satisfy: λs ∈[0,1]. This kernel function by components is not a measure, as it was the one
defined in (11). This is due to the fact that it does not add up to 1. No problem arises,
since we are not interested in estimating a PDF.
A vector kernel function is defined as the product of the kernel functions by components,
exactly as it was defined in (12):
L(Xid, xd, λ) =
n
Y
i=1
l(Xisd, xds, λ)
=
r
Y
s=1
λNis(x)
s (33)
Where Nis(x) = 1 if Xisd 6=Xsd, and 0 otherwise.
Definition 14. Anonparametric estimatorgˆ(.) is proposed to estimate the true function
g(.) of (30) as:
ˆ
g(xd) = n
−1Pn
i=1YiL(X
d
i, xd, λ)
ˆ
p(xd) (34)
Where ˆp(xd) = n−1Pn
i=1L(X
d
i, xd, λ).
Notice that if λ = 0 this estimator becomes a frequency estimator that takes the average values of Y at point xdas the image of ˆg(xd). We are now interested in stating a theoretical result about the convergence of ˆg(.), and this is done next.
4.2
Simple theoretical case
Lets consider the case when r = 1. Therefore the vector estimator collapses to the single component kernel function:
L(Xi, x, λ) =l(Xi, x, λ)
This implies that the estimator for ˆg(.) is simply given by:
ˆ
g(x) = 1
n
n
X
i=1
L(Xi, x, λ) =
1
n
n
X
i=1
l(Xi, x, λ)
Theorem 7. Having defined the estimator for g(.), it is true that:
ˆ
g(x)−g(x) = Op λ+n−1/2
(35)
Proof:
We will give a sketch of the proof. First of all, we define ˆm(x) as follows:
ˆ
Therefore, it is true that:
ˆ
g(x)−g(x) = mˆ(x) ˆ
p(x)
The first objective is to prove that: E[ ˆm(x)] =O(λ). Lets expand ˆm(x) using all definitions made up to this point:
ˆ
m(x) =
n−1Pn
i=1YiL(Xi, x, λ)
ˆ
p(x) −g(x)
ˆ
p(x)
= n−1
n
X
i=1
YiL(Xi, x, λ)−g(x)ˆp(x)
= 1
n
n
X
i=1
YiL(Xi, x, λ)−g(x)
1
n
n
X
i=1
L(Xi, x, λ)
= 1
n
n
X
i=1
[Yi−g(x)]L(Xi, x, λ)
There exists an index k ∈ {1,2, ..., n} such that Xk =x. We should also keep in mind that
Yi =g(Xi) +Ui and that E[Ui|Xi] = 0. Therefore, applying the expected value operator we
have:
E[ ˆm(x)] = 1
n[g(Xk)−g(x)] +
1
n
X
i6=k
[g(Xi)−g(x)]λ
= 1
n
X
i6=k
[g(Xi)−g(x)]λ
Let R= max{|g(Xi)−g(x)|:i6=k}. Therefore:
E[ ˆm(x)] = 1
n
X
i6=k
[g(Xi)−g(x)]λ≤
n−1
n Rλ < Rλ
Which proves that E[ ˆm(x)] =O(λ). With a similar procedure it is relatively easy to prove that Var( ˆm(x)) =O(n−1). Both results imply that3:
E
ˆ
m(x)2
=O(λ2+n−1)
Which implies that:
ˆ
m(x) = Op(λ+n−1)
Finally it is true that ˆp(x) =p(x) +op(1). So, putting all together we have:
ˆ
g(x)−g(x) = mˆ(x) ˆ
p(x) =
Op(λ+n−1/2)
p(x) +o1(1)
= Op(λ+n−1/2)
This theorem is important because it states that if λ −−−→n→∞ 0, the estimator ˆg(.) converges to the true function g(.) in probability.
4.3
Empirical estimation
In this subsection we are interested in estimating g(.) using a set of n random i.i.d. draws. For practical purposes, we will assume that X is a one dimensional vector, and that X and
Y are related through (30).
We randomly generate the tuples{(Xi, Yi)}ni=1 in the following way. First, we fix n= 1,000,
which means that our random sample will be of this size. Second we generate Xi from the
normal distribution with mean 0 and variance 1004. ThereforeX ∼N(0,100). Later on, we
assume the following functional form between X and Y:
Yi =Xi3+X
2
i + 3
0
750,000Vi (36)
Where V ∼U([−1,1]). The idea with this equation is to produce a relation between X and
Y, introducing some random noise. It is important to notice that up to this point, we have generated {(Xi, Yi)}ni=1, but withX generated as a continuous random variable. In order to
be able to use nonparametric regression the way it has been presented here, we need X to have a finite support. That is why we discretize X and Y, rounding them to the nearest integer. Next figure shows this relation graphically:
Figure 10
−40000000
−20000000
0
20000000
40000000
Y
−500 −400 −300 −200 −100 0 100 200 300 400 500 X
Scatterplot of the Random Draws
So the objective is to be able to estimate this relation presented in the figure above, with
the nonparametric estimator defined in this section. To this purpose we use the estimator
4This is done as follows: a vector ofncomponents taken from the uniform distribution in [0,1] is created.
defined in (30), which is given by:
ˆ
g(xd) = n
−1Pn
i=1YiL(X
d
i, xd, λ)
ˆ
p(xd)
In order to estimate g(.) for every value in the support of X. Notice that a different value of λ produces a different estimation. Therefore we could say that ˆg(Xi) = ˆg(Xi, λ). That is
why we define a Mean Squared Error of the form:
MSE(λ) = 1
n
n
X
i=1
(Yi−ˆg(Xi, λ))
2
(37)
Next figure shows this MSE for λ∈[0,1]:
Figure 11
0
5.000e+12
1.000e+13
1.500e+13
MSE
0 .2 .4 .6 .8 1
λ
Total MSE
So the lowest MSE is produced by λ = 0. This means that in this case, the frequency estimator is the best estimator in the context of (30). This might occur because we are
not producing estimates out of sample. The point is that no estimation out of sample is
needed because inside the support of X, there is positive mass for every possible value of the support. Estimating with λ = 0 we have that:
Figure 12
−40000000
−20000000
0
20000000
40000000
Y
−500 −400 −300 −200 −100 0 100 200 300 400 500 X
Real Data Prediction Real vs. Predicted Data
It is obvious that estimating with λ = 0 produced excellent results. Prediction is obtained with high accuracy. But we should acknowledge that λ = 0 may not always be the best alternative, and that in this case it is because we are not estimating out of sample. Another
issue that may induce λ > 0 would be a small sample size n. This would occur because with higher probability, there would be values of the support of X that would not have positive mass of occurrence with the n random draws. It is again possible to make use of Crossvalidation methods in order to choose λ. This will be explained next.
4.4
Cross-validation
Let’s consider the general r-dimensional case. Having studied thoretical properties about the nonparametric estimator for g(.), and having developed a numerical example to show its use, it is natural to ask about the best way to choose λ1, λ2, ..., λr. This can be done
minimizing the cross-validatory sum of squared residuals:
CV(λ) = 1
n
n
X
i=1
Yi−ˆg−i(Xid)
2
Where:
ˆ
g−i(Xid) =
n−1Pn
j6=iYjL(Xid, Xjd, λ)
ˆ
p−i(Xid)
And:
ˆ
p−i(Xid) =
1
n =
X
j6=i
This method is useful since it excludes the case i=j when calculating ˆg(.), and this is pre-cisely the case that was doing the frequency estimator a powerful estimator when estimating
in-sample. But when one is interested in estimating out of sample, λ plays a crucial role, and that is exactly what this cross-validation method is taking into account.
5
Conclusions
The objective in this paper was to show the way estimation with kernel methods work in
theory and in practice. This methods are useful to estimate PDFs of continuous and discrete
random variables, as was shown in Sections 2 and 3, particularly because they needed no
assumptions and were able to produce accurate results. We showed that parametric methods
with wrong assumptions could produce completely mistaken estimations. Finally, Section
4 introduced a richer subject: being able to estimate nonparametrically complex relations
between an univariate variable Y and a multivariate variable X, in the case that X had a finite support.
Kernel estimation methods are being used more frequently because of their precision
with-out strong assumptions. But costs are imposed in order to achieve such advantages. First,
kernel methods are not widely known, and the theory behind them needs more statistical
knowledge than traditional parametric methods. Second, they are more difficult to
imple-ment since most statistical packages still do not have routines on nonparametric methods.
Third, nonparametric methods may converge slower than parametric ones, and that is costly
in terms of time and computational needs. Even with this problems on the scene, this
meth-ods are powerful and accurate, and will become more used as common knowledge on them
6
References
Aitchison, John and Colin Aitken, “Multivariate Binary Discrimination by the Kernel Method”, Biometrika Trust, 2013.
Blanco, Liliana, “Probabilidad”, Universidad Nacional de Colombia, 2004.
Bowman, Adrian. and P.J. Foster, “Adaptive Smoothing and Density-Based Tests of Multivariate Normality”, Journal of the American Statistical Association, 1993.
Horowitz, Joel L., “Semiparametric and Nonparametric Methods in Econometrics”, Springer Series in Statistics, 2009.
Greene, William H., “Econometric Analysis”, Prentice Hall, Seventh edition, 2011.
Li, Qi and Jeffrey Scott Racine, “Nonparametric Econometrics”, Princeton University Press, 2007.
Woolridge, Jeffrey M., “Econometric Analysis of Cross Section and Panel Data”, MIT Press, Second Edition, 2011.