• No se han encontrado resultados

• the intuitive concept of resources;

• (transformations of) probability distributions;

• computational resources such as computation time and computation space.

As of today, the formal relationship between probability distributions and computa-tional resources is not understood, and in fact relates to many deep open questions in complexity theory. Fortunately, some progress into this direction can be made by re-stricting ourselves to understanding the relationship between resources and probability distributions by drawing ideas from thermodynamics and coding theory.

Consequently, the aim of this chapter is to introduce an information-theoretic (or more accurately, a thermodynamic) formalization of resources. While this formalization is admittedly non-standard from a control-theoretic point of view, it will prove to be fruitful to reinterpret the construction and the behavior of an autonomous system. In particular, it will allow to state a natural link between resources and probabilities, in a way such that behavior can be thought of as arising only from resource costs.

Furthermore, we will sketch an intuitive but non-rigorous connection to computational complexity.

receiver by encoding it as a string x≤n∈ X and then transmitted symbol by symbol2. These symbols are recovered as a (possibly corrupted) string y≤n ∈ X which is then decoded as the object v∈ U.

Sender Encoder Channel Decoder Receiver

u x≤n y≤n v

Figure 6.1: The communication problem consists in communicating a (noisy) choice of a target objectv∈ V via the selection of a source object u∈ U using a communication channel.

This is done by encoding uas a string x≤n ∈ X that is then transmitted over the channel.

The transmitted string is recovered as a string y≤n ∈ X that is then decoded as the target objectv. The blocks highlighted in bold correspond to the processes that have to be designed by the engineer.

Most of the time, we will assume a binary alphabet for the channel, that is X = {0,1}, because the choice of an alphabet changes the length of the codes by at most a constant factor that depends only on the size of the alphabet. We can thus think of the communication problem as the design of binary codes that efficiently transmit a choice using a given binary communication channel. Due to the central role that codes play in information theory, the next subsection investigates their basic properties.

6.1.2 Codes

A code is a scheme to represent objects as binary strings. A badly chosen code can lead to inefficient or ambiguous representations. An inefficient code is a code that does not minimize the length of its codewords (measured in bits), and an ambiguous code is a code where the original object cannot be recovered uniquely anymore. In this section we investigate the property that a code has to possess in order to be both efficient and unambiguous. The exposition in this subsection follows closely the one presented in Gr¨unwald (2007). We first require a definition of suitable binary codes.

Definition 20 (Prefix Free) Given a finite alphabet X, a subset P ⊂ X is called prefix free iff there are no distinct membersu, v∈ P such thatu is a prefix of v. 2

For instance, for the binary alphabet{0,1}, the sets

P1 ={001,01,101,11}, P2={00,01,10,11} and P3 ={0,10,110,111} are prefix free, but

P4 ={01,011,1}

2Recall thatXis the set of all finite strings over the alphabetX.

is not because 01 is a prefix of 011. In fact,P4is even ambiguous: if we want to decode the string 011 we would not be able to tell whether it corresponds to the codeword 011 alone or to the concatenation of the codewords 01 and 1.

Definition 21 (Prefix Code) Given a finite set of objects U, a prefix code for U is an injective function c : U → P, where P ⊂ {0,1} is a prefix free set of binary strings. For any object u∈ U,c(u) is called thecodeword of u, and l(u) denotes the

correspondingcodeword length. 2

Prefix codes are important because of two reasons. First, they can be uniquely decoded; and second, when a given codeword is scanned from left to right, the end of the codeword is detected instantaneously. Prefix codes can be represented with a binary tree, where codewords correspond to the paths starting from the root until a leave is reached (Figure 6.2).

P1 P2 P3

0

1

c(a) = 001

c(b) = 01

c(c) = 101

c(d) = 11

c(a) = 00

c(b) = 01

c(c) = 10

c(d) = 11

c(a) = 0

c(b) = 10 c(c) = 110

c(d) = 111

Figure 6.2: Three prefix codes P1,P2 and P3 over U = {a, b, c, d}. Note that P1 could allocate codewords for at least two additional objects (namely, codewords with prefixes 000 or 100), while the latter ones cannot (i.e. they are complete). Furthermore, if one wants to shorten a codeword in a complete prefix code, then other codewords must necessarily grow.

From the examples in Figure 6.2, it is apparent that one cannot compress the codewords indefinitely. Codewords seem to allocate some “room”, of which there is only a limited amount available. The shorter a codeword, the more “room” it takes away for allocating other codewords. For instance, compare the codes P1 and P2. In P1, the allocation of codewords is not optimal, in the sense that the codewords for ‘a’

and ‘c’ can still be shortened, as has been done inP2. Then, compare the codesP2 and P3. There, we decided to shorten the codeword for ‘a’ even more from 00 to 0. This however exceeds the limit, and hence in exchange we had to allocate longer codewords for the objects ‘c’ and ‘d’ to preserve the unique decodeability. This intuition is made precise by the following theorem.

Theorem 4 (Kraft-McMillan Inequality) There is a prefix code over a finite set of objects U with codeword lengths l1, l2, . . . , l|U | iff they satisfy the Kraft-McMillan

inequality

X|U |

i=1

2−li ≤1.

A prefix code whose codeword lengths satisfy this bound with equality is said to be

complete. 2

Proof (⇒) Interpret 0.u as the binary expansion of a real number in [0,1). For each u∈ U, let Γu:= [0.u,0.u+ 2l(u)), i.e. the right-open interval on the unit line starting at 0.u and ending at 0.u+ 2l(u). Observe that the measure of this interval is 2l(u). Since the Γuare disjoint and contained within [0,1), the total measure cannot exceed 1, thus proving that the inequality holds for prefix codes.

(⇐) Assume w.l.g. that thel1, . . . , l|U | are non-decreasing. Choose adjacent disjoint intervalsI1, . . . , I|U | of lengths 2−l1, . . . ,2−l|U | starting from the left end of the interval [0,1). The total measure of the union of theIi is≤1 by construction and the assump-tion about the li. Note that since the li are non-decreasing, every interval Ii ends up aligned with an interval Γu for some binary string u. Takeu as thei-th codeword.

Thus, long codewords add little, while short codewords add much to the sum. In-formally, one can say that long codewords are “cheap”, while short codewords are

“expensive”. Accordingly, a prefix code that is incomplete can always be either com-pressed more or extended with extra codewords until it is complete.

But the Kraft-McMillan inequality allows us to say more than that. Consider a complete prefix code for U. This code fulfills the equality:

X

u∈U

2l(u)= 1.

Now, define

Pr(u) := 2l(u) for all u∈ U.

We have obtained a probability distribution overU! This is an important construction.

Essentially, the Kraft-McMillan inequality allows establishing a one-to-one correspon-dence between codeword lengths and probabilities. This works even when we allow codeword lengths to take on any positive real values as long as they satisfy the Kraft-McMillan inequality. This enables us to talk about a codeword length as being a direct specification of a probability that is invariant to the particular choice of the encoding alphabet, and even to the actual codewords.

6.1.3 Information

We now return to the problem of communication, i.e. the problem of communicating a choice using a communication channel. In the previous subsection, we have seen that the Kraft-McMillan inequality characterizes the limits of compressibility for unambiguous

codes. We have also seen that the inequality allows us establishing a bijection between codeword lengths and probabilities given by

l(u) =−logPr(u), (6.1)

where the quantity on the right-hand side is called the information content of u.

The function f(p) = −logp is depicted in Figure 6.3a, where we use the convention

−log 0 =∞. What is the operational meaning of this relation? Suppose that both the sender and the receiver agree on a code for U, and that the communication channel is noiseless. When the sender chooses object u∈ U, he has to transmit l(u) bits through the channel. Assuming this choice is made following a probability distribution Prover U, the expected amount of transmitted bits is given by

X

u

Pr(u)l(u).

If we want to make this communication as efficient as possible, then we have to choose a code that minimizes this expectation. One can show (MacKay, 2003, Section 5.3) that the optimal choice of the codeword lengths is precisely given by (6.1). Using (6.1), one sees that the minimum expected number of bits is given by

−X

u

Pr(u) logPr(u)

which is called the entropy of the probability distribution Pr. The function f(p) =

−plogpis shown in Figure 6.3b, where we use the convention 0·log 0 = 0. The entropy can be shown to fulfill a series of properties that make it a suitable measure of the

“uncertainty”, “randomness” and the “average amount of information” contained in the choice of u (Shannon, 1948; MacKay, 2003). However, if we choose a code with codeword lengths l(u)6=−logPr(u), then the expected codeword length is the cross-entropy

X

u

Pr(u)l(u) =−X

u

Pr(u) logPr(u),

where Pr(u) = 2l(u) are the probabilities “implicitly assumed” by the code. Obvi-ously, the cross-entropy is lower bounded by the entropy,

−X

u

Pr(u) logPr(u)≥ −X

u

Pr(u) logPr(u).

Notice how the efficiency of the communication does not only depend on the channel, but on its usage as well, i.e. the statistics of the information that is transmitted.

Intuitively speaking, if we knew beforehand what choice was going to be made, then it would not be necessary to transmit any bit through the communication channel. But because we are uncertain about the realization, we have to place bets, bets which are implicitly captured by the code.

a) b)

logp plogp

p p

Figure 6.3: Information functions.

Thus, the information content corresponds to the amount of bits that have to be transmitted in order to communicate a choice. However, what happensduring trans-mission, i.e. when not all the bits have been transmitted yet? The few bits that got sent did indeed communicate part of the choice. Communicating only part of the choice means that some of the possible choices were ruled out by the transmitted informa-tion. Consider the codeP3 depicted in Figure 6.4a and assume that the first bit of the choice, say ‘1’, is sent. This transmission changes the codeword functionc to another codeword function c depicted in Figure 6.4b, where the codewords starting with ‘1’

are shortened by this bit and the codewords starting with ‘0’ are discarded.

0 0 0

0 0

1

1 1 1

1 1

a) b)

c(a) = 0

c(b) = 10 c(c) = 110

c(d) = 111

c(b) = 0 c(c) = 10

c(d) = 11 Figure 6.4: Probability versus Codeword Length (in bits).

The change in the expected codeword length can be calculated as

final expected codeword length

initial expected codeword length

=−X

u

Pr(u) logPr(u) +X

u

Pr(u) logPr(u) =X

u

Pr(u) log Pr(u) Pr(u). In the previous calculation, the discarded codeword lengths are assumed to be equal to ∞ by convention, and the actual distribution is Pr(u) = 2l(u), where l is the

codeword length function associated to c. In the example of Figure 6.4, this change is X

u

Pr(u) log Pr(u)

Pr(u) = 0·(∞ −0) + 1

2·(1−2) +1

4 ·(1−3) +1

4 ·(1−3) =−1, as expected. Hence, the negative of the previous quantity, that is

X

u

Pr(u) logPr(u) Pr(u),

called therelative entropyorKullback-Leibler divergence, measures the amount of bits that have to be transmitted in order to change the knowledge about the choice from Pr(u) to Pr(u). Note that the relative entropy is always positive and zero iff Pr =Pr. More generally, the relative entropy corresponds to the amount of bits that have to be paid in order to transform an initial probability distribution Pr into a final probability distributionPr. Because of this, the relative entropy will play a fundamental role in our formalization of resources.

Documento similar