• No se han encontrado resultados

3. SITUACIÓN ACTUAL DE LA ACCIÓN POPULAR

3.3. REQUISITOS OBJETIVOS

We introduce the useful concept of mutual information.

Definition 2.10 The mutual information is given by

I pX,Y

=HY pY

−HY|X pX,Y

Mathematical Preliminaries 71

Here we have denoted the (unconditional) entropy byHY as opposed to just

H, in order to make the distinction between this entropy and the conditional entropy clearer. It follows from Definitions 2.10, 2.6, and 2.8 that

I pX,Y =X x∈X X y∈Y pX,Y x,y log pX,Y x,y pX xpYy ! . (2.159)

The concept of mutual information, and therefore the concept of entropy, plays an important role in the context of transmission of data through a noisy channel; this role derives from Shannon’s channel capacity theorem. In order to discuss this theorem, let us consider a data transmission channel that takes the random variableX with probability distributionpX as input

and provides the random variableY with conditional probability distribution

pY|X as output. We define the channel capacity,C, as

C= log 2 max

pX I p

X,Y

, (2.160)

where I pX,Y

is the mutual information from Definition 2.10. The channel capacity theorem, which is a central result of information theory, states that the above channel capacity, C, is the highest rate in bits per channel use at which information can be sent with arbitrarily low error probability (see Shannon (1948), or Cover and Thomas (1991), Chapter 8).

2.3.5

Entropy and Relative Entropy for Probability Densi-

ties

So far, we have discussed entropy and relative entropy for random variables with a discrete state space. However, one can generalize these concepts to continuous (one-dimensional) random variables. In this section, we will ex- plicitly discuss these generalizations for the case of unconditional probability measures; the extension to conditional measures and to higher dimensional random variables is straightforward.

Definition 2.11 The entropy of the probability densitypfor the continuous random variable Y with supportY is

H(p) =Ep[logp] =−

Z

Y

p(y) logp(y)dy . (2.161)

This entropy is sometimes called the differential entropy.

Definition 2.12 The relative entropy between the probability densitiespand

q for the continuous random variableY with support Y is

D(pkq) =Ep log p q = Z Y p(y) log p(y) q(y) dy . (2.162)

72 Utility-Based Learning from Data

For probability densities that are Riemann integrable, we can identify the above entropy and relative entropy with the continuum limits of the corre- sponding quantities for a discrete random variable.

Lemma 2.13 Let us assume that the probability densities p and q of the random variable X are Riemann integrable, and let

p∆k =p(yk)∆, and (2.163)

q∆k =q(yk)∆, (2.164)

where theyk are the mid-points of size-∆bins that partition Y. Then

H(p) = lim ∆→0 H p∆ −log ∆ , and (2.165) D(pkq) = lim ∆→0D p ∆ kq∆ . (2.166)

Proof:In order to prove (2.165), we use Definition 2.6 to write

H p∆ =−X k p∆k logp∆k =−∆X k

p(yk) logp(yk)−log ∆ (from (2.163)).

(2.165) follows then directly from Definition 2.11 and the definition of the Riemann integral. Equation (2.166) follows in the same manner from Defini- tions 2.7 and 2.12 and (2.164); the term log ∆ drops out, since we have the logarithm of the ratio of two probabilities here.2

Some mathematical properties of the entropy and relative entropy for contin- uous random variables

The following lemma lists some properties.

Lemma 2.14 The entropy, H(p), of the probability distribution pof a con- tinuous random variable

(i) is a concave function ofp, and (ii) is nonnegative.

The relative entropy, D(pkq), of the probability distributions p and q of a continuous random variable has the following properties.

(iii) D(p, q) is convex in(p, q).

(iv) If the probability densitiespandq are Riemann integrable, thenD(p, q)

is strictly convex inp.

Mathematical Preliminaries 73

Proof:For statements(i)and(iii), the proof of Theorem 2.7.2 in Cover and Thomas (1991) applies. Statement(iv)follows from the strict convexity of the discrete relative entropy in its first argument (see Lemma 2.10) and Lemma 2.13. For statement (v), see Cover and Thomas (1991), Theorem 9.6.1. For statement (ii) the proof of Theorem 9.6.1 in Cover and Thomas (1991) can be easily modified.

2.4

Exercises

1. Suppose thatZ N(0,1). Show that the density function forX =Z2

is given by 1 √ 2πxe −x 2 (2.167)

(Chi-squared with 1 degree of freedom). Verify your result with a nu- merical simulation.

FIGURE 2.10: χ2 distribution, plotted on the interval [.01,5].

2. (a) IfX is distributed uniformly on (a, b), show thatE[X] =b+a 2 and

var(X) = (b−12a)2.

(b) Show directly thatZ∼N(0,1) indeed has mean 0 and variance 1. (c) Show that if E[X] = µ, and Xi, i = 1, . . . , N denote repeated

realizations ofX, thenE[X] =µ, where the sample average

X =

PN

i=1Xi

N . (2.168)

74 Utility-Based Learning from Data

3. The Dirac delta function can be understood as the limit of a sequence of smooth functions

δ(x) = lim

→0δ(x), (2.169)

whereδ(x) has the property that

lim

→0

Z ∞

−∞

δ(x)f(x)dx=f(0) (2.170)

for all continuous functions,f. Functions δ(x) with this property are

referred to as nascent delta functions. Show that the pdf of a random variable that isN(0, ) is a nascent delta function.

4. Prove (2.40) and (2.41), i.e., prove that if the random vector X = (X1, . . . , Xn)T has expectation vector µ(withith element µi), and co-

variance matrixcov(X, X) = Σ (withijth elementcov(X

i, Xj)), and if

Ais a matrix withncolumns, then

E[AX] =Aµ, (2.171) and

cov(AX, AX) =AΣAT. (2.172)

5. Show that for the nonnegative discrete-valued random variableX, that takes values 0,1,2, . . . , E[X] =X n≥0 prob{X > n}=X n≥1 prob{X≥n}. (2.173)

Hint: apply the definition of expectation, using the identity

n prob{X =n}=

n

X

k=1

prob{X =k}. (2.174)

Verify your result for the distribution

prob{X=j}= (e1)e−j−1, forj

≥0 (2.175) with a numerical simulation.

6. (a) Use the Markov inequality to prove the Chebyshev inequality (b) Using the Schwarz inequality, show that1ρ[X, Y]1

7. Suppose that K is running againstB in an election and pis the per- centage of eligible voters who will vote for B. Using the Chebyshev inequality, estimate the number of people who should be polled to in- sure that the probability is.95 that the sample average differs from p

Mathematical Preliminaries 75

8. If X and Y are independent random variables, then cov[X, Y] =

ρ[X, Y] = 0.

9. The coefficients,a, b, c, ofax2+bx+care independent random variables and each is distributed uniformly on the interval (0,1). Give a closed- form formula for the probability that the solutions of the equationax2+

bx+c= 0 are real. Verify your result with a numerical simulation. 10. X andY have a constant joint density,p(x, y), on the regionx0, y

0, x+y 1. Find p(x, y), p(y), p(x|y), E[X|y], and E[X]. Verify your results with numerical simulations.

11. A boss leaves work at timeX, which is distributed uniformly on (0, T). Someone who works for the boss leaves at timeY, which is distributed uniformly on (X, T).

CalculateE[Y|X], E[Y],var(E[Y|X]),E[var(Y|X)], andvar(Y). Verify your results with numerical simulations.

12. Suppose that X is a standard normal random variable. Then U =

h(X) = µ+σX has mean µ and variance σ2. Use Theorem 2.3 to

show that (2.79) holds, i.e., that the density forU is given by 1

σ√2πe

−(u2−σµ2)2. (2.176) 13. We visit a random number of stores, N, and spend Xi in store i ∈

{1, . . . , N}, whereXiare i.i.d. (independent and identically distributed)

and independent of N, with E[Xi] = µ and var(Xi) = σ2. Let Y =

X1+· · ·+XN. Show that

E[Y] =E[N]µ,

var[Y|N=n] =nσ2, and

var[Y] =E[N]σ2+µ2var[N].

14. Derive the convex conjugate of the function 1 p` p p(x), where `p(x) =   n X j=1 |xj|p   1 p

is the`p-norm, and of the function`∞, where

`∞(x) = max

j=1...n|xj|. (2.177)

76 Utility-Based Learning from Data

16. Let

x1(α) = arg min Ψ(x)≤αΦ(x),

andx2(γ) = arg min

x∈Rn{Φ(x) +γΨ(x)} where Φ : Rn

→ R, Ψ : Rn

→ R are strictly convex, and α, γ R. Show that, for a given α, if the Slater condition holds for the first of these optimization problems, there exists aγ∗(α) such thatx2(γ(α)) = x1(α).

17. Consider the following problem.

Problem 2.6 Find F = inf x∈Rn, cRl{Φ(x) +γΨ(c)} (2.178) s.t. fi(x)≤0, i= 1, ..., m (2.179) and hj(x) =cj , j= 1, ..., l , (2.180)

wherec= (c1, ..., cl)T,Φ :Rn→R,fi :Rn→Rare convex and differ-

entiable,Ψ :Rl

→R is a convex function of the vectorc= (c1, ..., cl)T

that attains its minimum, 0, at c = (0, ...,0)T, the h

j : Rn → R are affine, andγ >0. Letgγ gγ(λ, ν) = min x∈Rn,cRl    Φ(x) +γΨ(c) + m X i=1 λifi(x) + l X j=1 νj[hj(x)−cj]    (2.181) be the Lagrange dual function corresponding to this problem. Show that

gγ(λ, ν) = ˆg(λ, ν)−γΨ∗(γ−1ν), (2.182)

where Ψ∗ is the convex conjugate of Ψ and ˆg is the Lagrange dual

function of Problem 2.1.

18. Prove Lemma 2.8, statement (ii).

19. Provide an example which shows that the relative entropy is generally not symmetric in its arguments.

20. Prove the chain rule for the entropy:

HX,Y pX,Y

=HY|X pX,Y

+HX pX

Mathematical Preliminaries 77

whereHY|X is the conditional entropy from Definition 2.8,HX is the

entropy from Definition 2.6, and

HX,Y =−EpX,Y

logpX,Y

=− X

x∈X, y∈Y

pX,Yx,y logpX,Yx,y (2.184)

Chapter 3

The Horse Race

Probabilistic models are often used by decision makers in uncertain environ- ments. An idealization of such a decision maker, on which we heavily rely in this book, is a gambler, or investor (we use the terms interchangeably), in a horse race. In this chapter, we introduce the notions of the horse race and the conditional horse race and discuss some simple relationships between probability measures and betting strategies, while leaving a more thorough decision-theoretic treatment for later chapters. Most of the concepts and re- sults in this chapter can be found in the textbook by Cover and Thomas (1991) or in the original papers by Kelly (1956) and Breiman (1961).

We shall first discuss the (unconditional) horse race as a setting in which we explore unconditional probabilities, and then generalize it to the conditional horse race, which is a useful picture when we are interested in conditional probabilities. The basic ideas that we shall apply later in this book can be most easily understood in the unconditional probability context and don’t have to be substantially modified in the unconditional probability context.

A horse race investor who invests so as to maximize his expected wealth growth rate — a so-called Kelly-investor — allocates money to each horse in proportion to the horse’s winning probability. The expected wealth growth rate for such an investor is the difference between the expected wealth growth rate for a clairvoyant investor and the entropy of the winning-probabilities. Expected wealth growth rates are also related to the relative entropy: the latter is a difference between two expected wealth growth rates. These two relationships, which hold for the conditional and the unconditional horse race, are very important, as they provide a simple decision-theoretic interpretation for information-theoretic quantities. In Chapter 8, when we discuss decision makers with arbitrary risk preferences, we shall use these relationships as a starting point for a generalization of entropy and relative entropy.

80 Utility-Based Learning from Data

3.1

The Basic Idea of an Investor in a Horse Race

Horse race

Definition 3.1 (Horse race) A horse race is characterized by the discrete random variable Y with possible states in the finite set Y; we identify each element ofY with a horse. An investor can place a bet thatY =y∈ Y, which pays the odds ratio (payoff) Oy >0 for each dollar wagered ifY =y, and 0,

otherwise.

Apart from an actual horse race, the following settings are examples that meet either exactly or approximately the above definition:

• betting on a coin toss,

• investing in defaultable bonds, • playing roulette or blackjack, and • bringing a new product to the market.

We note that an investor who allocates $1 of capital, investing B

Oy to state y, where B= P 1 y∈Y 1 Oy ,

receives the payoffB with certainty. This motivates the following definition:

Definition 3.2 (Bank account) The riskless bank account payoff,B, is given by

B= P 1

y∈Y O1y

, . (3.1)

We also note that OBy >0 and

X

y∈Y B

Oy

= 1,

so BO =nOBy, y∈ Yois a probability measure onY. Under this measure, the expected payoff for a bet placed on a single horse,y, is alwaysB, independent ofy. So we make the following definition.

Definition 3.3 The homogeneous expected return measure is given by

p(h)= p(yh)= B Oy , y∈ Y . (3.2)

The Horse Race 81

Let us suppose the bookie was risk-neutral, i.e., demanded the same return on each horse, no matter what the associated risk is, and that there was no track take. Then, if the bookie believed in the homogeneous expected return measure, p(h), he would set the odds ratios

O. This provides an — albeit somewhat unrealistic — interpretation ofp(h)as the measure that an idealized

bookie believes.

Investor

The following definition makes precise what the term ‘investor’ shall mean throughout this book, unless indicated otherwise.

Definition 3.4 (Investor) An investor is a gambler who invests$1 in a horse race, i.e., who allocatesby to the eventY =y, where

X

y∈Y

by = 1. (3.3)

We denote the investor’s allocation by

b={by, y∈ Y} . (3.4)

We have made the assumption of $1 total investment for convenience, but without loss of generality; we may view this $1 as the investor’s total wealth in some appropriate currency. In particular, we can choose the investor’s initial wealth as currency.

Documento similar