Mathematical Theory of Information (Shannon)
Olimpia Lombardi.
CONICET – University of Buenos Aires1.- Introduction
Information is everywhere, shaping our discourses and our thoughts. In everyday life, we know that the information spread by the media may trigger deep social, economical and political changes. In science, the concept of information has pervaded almost all scientific disciplines, from physics and chemistry to biology and psychology. For this reason, nowadays the understanding of the concept of information turns out to be particularly relevant.
In general, information has content or meaning. Nevertheless, it can be quantified independently of its content. The mathematical theory of information, as developed by Claude Shannon in his classical article of 1948, supplies the formal tools to do that: “Frequently the messages have meaning; that is they refer to or are correlated according to some system with
certain physical or conceptual entities. These semantic aspects of communication are
irrelevant to the engineering problem. The significant aspect is that the actual message is one
selected from a set of possible messages.” (Shannon 1948: 379). That article was immediately
followed by many works of application to communication fields as radio, television and telephony. At present, Shannon’s theory is a basic ingredient of the communication engineers’ training.
2- Games and information
Everybody knows the Twenty Questions Game: one player thinks of an object and the other player has twenty chances to guess it by asking a yes-no question. In order to simplify the game, let us suppose that the aim is to guess a number in a range between 1 and 8, say, 4. The best strategy is to ask if the number is in the lower half of the range (or in the upper half), until the range is just a single number. For example:
“Is it less than 5?” “Yes” “Is it less than 3?” “No”
“Is it less than 4?” “No” So, it is 4!!
But, how many questions we need to guess a number? As we see, if we have 8 alternatives, the number N of questions is 3. If the range doubles, say from 8 to 16, one needs one more question to find the answer: N is 4. It is easy to realize that the number of questions N is such that 2N is at least the number of alternatives n. Thus, guessing a number between 1 to n requires N =log2n questions. Therefore, N =log2nmeasures the amount of information contained in the fact that a particular number is selected in the range 1 to n, and, in a completely generic case, contained in the fact that a particular case is selected among n alternatives.
Ralph Hartley, in his 1928 paper called “Transmission of Information”, was the first in using the word ‘information’ in a technical sense: for Hartley, it refers to a measurable quantity that expresses the resources needed to identify one among many equally likely alternatives, regardless subjective or semantic aspects. But it was necessary to wait until 1948 to introduce non-equiprobable situations.
3- Probabilities and information
In the previous section we have considered cases in which all the alternatives are equally likely. However, the situations in which there are more likely and less likely cases are very common, and uneven distributions are particularly relevant to the amount of information.
Let us suppose that somebody tells us that the sun will rise tomorrow: we do not consider to have received much information, since the fact that the sun rises every day is almost certain. However, if one is told that he won the lottery, the amount of information received is very high, because winning is much less likely than losing. In both cases we are considering two alternatives (sun rising-sun not rising; winning-losing), but the alternatives are not equally likely. It seems clear that the more improbable is an event, the more information one gets from knowing that it happened. Then, the amount of information that an event contains is related to how easy or hard to guess it is.
equiprobable case, the probability of any ai is given by 1 n ; therefore,
(
)
2 2
( )i log 1 ( )i log
H a = p a = n.
Although other units of measurement can be defined, the standard unit for information is called ‘bit’ –contraction of binary unit–: one bit measures the amount of information obtained when one of two equally likely alternatives is specified. One toss of a fair coin provides one bit of information, since before tossing the coin either heads or tails is equally likely to be the result.
Up to now we have considered the information provided by single events. But communication engineering is not concerned with the occurrence of specific events, but with the communication process as a whole. Hence, Shannon focused on average amounts of information and on the way in which such information can be reliably transmitted.
4.- Source, destination and average amounts of information
According to Shannon (1948, see also Shannon and Weaver 1949), a general communication system consists of five parts (see Figure 1):
− A source S, which generates the message to be received at the destination.
− A transmitter T, which turns the message generated at the source into a signal to be transmitted. When information is encoded, coding is also implemented by this system.
− A channel CH, that is, the medium used to transmit the signal from the transmitter to the receiver.
− A receiver R, which reconstructs the message from the signal.
− A destination D, which receives the message.
The source S is a system of n states s , each with its own probability i p s ; the ( )i sequences of states are called messages. Since the amount of information generated at S by the occurrence of s is ( )i I si = −log ( )p si (from now on, log means log ), the entropy of the 2 source S is defined as an average, that is, as the sum of the individual amounts of information
weighted by the corresponding probability:
Figure 1: General communication system
1
( ) ( ) log ( )
n
i i
i
H S p s p s
=
= −
∑
Analogously, the destination D is a system of m states dj, each with its own probability ( j)
p d . The amount of information received at D by the occurrence of dj is ( j) log ( j)
I d = − p d , and the entropy of the destination D is the average amount of information received at D:
1
( ) ( ) log ( )
m
j j
j
H D p d p d
=
= −
∑
The relationship between H S( ) and H D( ) can be represented as in Figure 2, where:
• H S D is the mutual information: the average amount of information generated at S and ( ; ) received at D.
• E is the equivocation: the average amount of information generated at S but not received at D.
• N is the noise: the average amount of information received at D but not generated at S.
Equivocation E and noise N are measures of the dependence between source S and destination D:
•If S and D are completely independent, the values of E and N are maximum (E=H S( ) and N =H D( )), and the value of H S D( ; ) is minimum (H S D( ; )=0).
•If the dependence between S and D is maximum, the values of E and N are minimum (E= =N 0), and the value of H S D is maximum (( ; ) H S D( ; )=H S( )=H D( )).
The values of E and N are functions not only of the source and the destination, but also of the communication channel CH, which introduces the possibility of errors during the transmission: CH is defined by the matrix p d( j si), where p d( j s is the conditional i) probability of the occurrence of d in the destination Dj given that si occurred in the source S, and the elements in any row sum to 1 (see Figure 3).
( ; ) ( ) ( )
H S D =H S − =E H D −N
Figure 2: Relationship between the entropies of the source and the destination
H(S) H(D)
H(S;D)
As Shannon stresses, in communication “[t]he significant aspect is that the actual message is one selected from a set of possible messages.” (1948: 379). Therefore, it is not
necessary that the source S and the destination D be systems of the same kind: for instance, S may be a dice and D a dash of lights; or S may be a device that produces words in English and D a device that operates a machine. In Shannon’s theory, the success criterion for
communication is given by a mapping from the set of the states s of the source S to the set of i the states d of the destination D. This mapping should be one-to-one (deterministic channel, j Figure 4) or at least one-to-many (noisy channel, Figure 5), since in these cases the occurrence of a given state at D makes possible to identify what state occurred at S. If the mapping is many-to-one, the occurrence of a state at D fails to identify the state occurred at S (channel with equivocation, Figure 6).
5.- Information and coding
Once the message is produced by the source S, the transmitter T turns it into a signal that can be transmitted through the channel. In certain cases, the transmitter is a mere transducer; for instance, in classical telephony it changes sound pressure into an electrical current. However,
Figure 3: Channel
[
p d( j si)]
Figure 4: Example of
source into a code-word, that is, a string of symbols, which usually are the binary digits 0 and 1.
Let us suppose that our source produces letters and that the messages are words in English. One strategy is to encode the letters with code-words of the same length. However, not all the letters are equally probable; for instance, the letter E occurs more frequently that X. Therefore, a better strategy is to code the most frequent letters with shorter code-words and the less frequent letters with longer code-words. Precisely this idea inspired Shannon to formulate the so-called Noiseless-Channel Coding Theorem, also known as First Shannon Theorem. According to this theorem, in the case of an ideal code, the minimum average
length of the code-words is given precisely by the entropy of the source H S . ( )
Let us consider an example given by Shannon himself: a source that produces a sequence of letters chosen from among A, B, C, D, with probabilities 1 2, 1 4, 1 8 and 1 8, respectively. If the four letters are encoded with code-words of the same length, for instance,
A 00
A 00
A 00
A 00
then the average length of the code-words is 2. But on the basis of the First Shannon theorem we know that there is a better coding. In fact, since the entropy of the source is:
(
)
1
7
1 1 1 1 1 1 1 1
( ) ( ) log ( ) log log log log bits
2 2 4 4 8 8 8 8 4
n
i i
i
H S p s p s
=
= −
∑
= − + + + =there exists a coding for which the average length of the code-words is 7 4, for instance, this one:
A 0
A 10
A 110
A 111
where clearly the most probable letter is encoded with a shorter code-word and the less probable letters are encoded with longer code-words. The average length of these code-words is computed as:
7
1 1 1 2 1 3 1 3
This example shows that, although defined as a statistical property of the source, the entropy H S also measures how much the messages produced by the source can be ( ) compressed. It is quite clear that this result is highly relevant from a technical viewpoint, since it shows that the resources needed to transmit information reliably is less than what pre-theoretically supposed.
Finally, it is worth to mention the Noisy-Channel Coding Theorem or Second Shannon Theorem. This theorem shows that the probability of error in the transmission can be kept
close to zero to the extent that the rate of information transmission over a channel is maintained below a property named channel capacity, which can be computed in terms of the mutual information H S D . ( ; )
6.- Interpreting the concept of information
Despite of its formal precision and its great many applications, Shannon’s theory still offers an active terrain of debate about its interpretation.
The concept most usually connected with the notion of information is that of knowledge: information provides knowledge, modifies the state of knowledge of those who receive it (e.g. Dretske 1981, Dunn 2001). In general, this epistemic reading of information is adopted by authors who embrace a subjective interpretation of probabilities (Caves, Fuchs and Schack 2002). A different view is that which considers information as a physical magnitude. This is the position of many physicists and most engineers, for whom the essential feature of information consists in its capacity to be generated at one place and transmitted to another, to be accumulated, stored and converted from one form to another (Landauer 1991, 1996). So, the link with knowledge is not a central issue, since the transmission of information can be used only for control purposes, such as operating a device at the destination end by controlling the source. In general, this view appears strongly linked with the dictum ‘no information without representation’: the transmission of information necessarily requires an information-bearing signal, that is, a physical process propagating from one point of space to another. Therefore, information tends to be conceived as a physical entity with the same ontological status as energy (Stonier 1990, 1996).
has been observed by noting the absence of another event. In this context, observation without direct physical interaction between the observed event and an appropriate destination is only admissible from an epistemic interpretation of information. According to a physical interpretation, by contrast, without interaction there is no observation: the event is only inferred (Lombardi 2004, Lombardi, Holik and Vanni 2014).
According to a third position, information is a formal item. There are no sources, destinations or signals, but only random variables and probability distributions over their possible values: the word ‘information’ does not belong to the language of factual sciences; Shannon’s theory is a new chapter of the mathematical theory of probability (Khinchin 1957, Reza 1961). It is not only that messages have no semantic content, but that the concept of information itself is purely mathematical. This syntactic nature is precisely what makes the concept a powerful tool for science. The relationship between the word ‘information’ and the different views about its meaning is the logical relationship between a formal term and its interpretations, each one of which endows it with a specific referential content (Lombardi, Fortin and Vanni 2014). The physical view is appropriate in communication, where information is transmitted by physical means. But this is not the only physical interpretation:
( )
H S may also represent the Boltzmann entropy of S. There are also non traditional
applications, as those based on the relation between Shannon entropy and gambling or investment in stock market (Cover and Thomas 1991). In turn, the epistemic view may be applied in cognitive sciences, where the concept of information has been used to conceptualize the human abilities of acquiring knowledge (Hoel, Albantakis and Tononi 2013). This formal view is in resonance not only with the wide presence of the concept of information in all contemporary sciences, but also with Shannon’s position when claiming: “The word ‘information’ has been given different meanings by various writers in the general field of information theory. [...] It is hardly to be expected that a single concept of information
would satisfactorily account for the numerous possible applications of this general field.”
(Shannon 1993: 180).
7.- References
Caves, C. M., Fuchs, C. A., and Schack, R. (2002). “Unknown Quantum States: The Quantum de Finetti Representation.” Journal of Mathematical Physics, 43: 4537-4559.
Dretske, F. (1981). Knowledge & the Flow of Information. Cambridge MA: MIT Press.
Dunn, J. M. (2001). “The Concept of Information and the Development of Modern Logic.” Pp. 423-427, in W. Stelzner (ed.), Non-classical Approaches in the Transition from Traditional to Modern Logic. Berlin: de Gruyter.
Hartley, R. V. L. (1928). “Transmission of Information.” Bell System Technical Journal, 7:
535-563.
Hoel, E., Albantakis, L. y Tononi, G. (2013). “Quantifying Causal Emergence Shows that Macro Can Beat Micro.” Proceedings of the National Academy of Sciences, 110:
19790-19795.
Khinchin, A. (1957). Mathematical Foundations of Information Theory. New York: Dover. Landauer, R. (1991). “Information is Physical.” Physics Today, 44: 23-29.
Landauer, R. (1996). “The Physical Nature of Information.” Physics Letters A, 217: 188-193.
Lombardi, O. (2004). “What is Information?” Foundations of Science, 9: 105-134.
Lombardi, O., Fortin, S. and Vanni, L. (2014). “A Pluralist View About Information.” Philosophy of Science, forthcoming.
Lombardi, O., Holik, F. and Vanni, L. (2014). “What is Shannon Information?” PhilSci Archive #10910.
Reza, F. (1961). Introduction to Information Theory. New York: McGraw-Hill.
Shannon, C. (1948). “The Mathematical Theory of Communication.” Bell System Technical Journal, 27: 379-423.
Shannon, C. (1993). Collected Papers, N. Sloane and A. Wyner (eds.). New York: IEEE Press.
Shannon, C. and Weaver, W. (1949). The Mathematical Theory of Communication. Urbana and Chicago: University of Illinois Press.
Shapere, D. (1982). “The Concept of Observation in Science and Philosophy.” Philosophy of Science, 49: 485-525.
Stonier, T. (1990). Information and the Internal Structure of the Universe: An Exploration into Information Physics. New York-London: Springer.