1.3 – ¿Cómo llegamos al popular feminism?

In a nutshell, the game is to choose binary words s1,...,sm into which the orig-inal file can be parsed (divided up), and then to replace each occurrence of each si in the parsed file with another binary wordwi; thewi are to be chosen so that the new file is shorter than the original, but the original is recoverable from the new. This kind of game is sometimes called zeroth-order replacement. You will see how “zero” gets into it later, when we consider higher order replace-ment. The assignment of thewi to the siis, as inChapter 4, called an encoding scheme.

For example, suppose we take s₁= 0 s₂= 10 s₃= 110 s₄= 1110 s₅= 1111,

(∗)

and the original “file” is111110111111101110111101110110. (Of course this is an unrealistic example!) The file can be parsed into the string s5s2s5s4s4s5s1s4s3. (Notice that there was no choice in the matter of parsing; the original binary word is uniquely parsable into a string of the si. Notice also that we are avoiding a certain difficulty in this example. If we had one, two, or three more 1’s at the end of the original file on the right, there is no way that we could have incorporated them into the source word, the string of si’s.)

Now, suppose we encode the siaccording to the encoding scheme s₁→ 1111

s₂→ 1110 s₃→ 110 s₄→ 10 s₅→ 0.

(∗∗)

The resulting new file is01110010100111110110. Notice that this file is 20 bits long, while the original is 30 bits long, so we have a compression ratio of 3/2.

5. 1 R epl ac em ent v i a enc odi ng s c hem e 121

(Not bad for an unrealistically small example! But, o f course, for any positive number R it is possible to m ake u p an example like the preceding in which the compression ratio is greater than R . SeeExercise 5.1.3.)

Bu t i s t h e old file recoverable from the new file? Ask a friend to translate th e n ew file in to a string of the symbols si , according to (∗∗). Be sure to hint th at scanning left to right is a good idea. Your friend should have no trouble in tr an slatin g th e n ew file in to s5 s2 s5 s4 s4 s5 s1 s4 s3 , from which you (or your friend!) can recover the old file by replacing th e sj with binary wo rds according to (∗).

Those who have perused chapter 4, or at least the first two sections, will not be su rprised about the first stage of the recovery process, because the encoding sc h e m e (∗∗) satisfies th e p r efix c o n d itio n . Yo u m ig h t n o tice, as well, th at th e defin itio n o f th e si in (∗), regarded as an encoding scheme, satisfies the prefix condition, as well. This is no accident!

We sh all r ev iew what we n eed to know about the p refix condition in the next section. For now, we single out a property of s1,...,s5 in the p receding ex a m p l e t hat may not be so obvious, but which played an important role in making things “work” in the example.

Definition A list s1,...,sm of binary words has the strong parsing property (SPP) if and only if every binary word W is uniquely a concatenation,

W = si₁···si_tv,

of some of the si and a (possibly empty) wordv, with no si as a prefix (see Sectio n 5 .2 ) , satisf y in g lg th(v) < max1≤i≤mlgth(si).

The wordv is called the leave of the parsing of W into a sequence of the si. The uniqueness requirement says that if W = si₁···sitv = sj₁···sjru, with neither u norv having any of the si as a prefix and lgth(v), lgth(u) <

maxilgth(si), then t = r, and i1= j1,...,it= jr, andv = u.

Notice that in any list with the SPP the si must be distinct (why?), and any rearrangement of the list will have the SPP as well. Therefore, we will allow ourselves the convenience of sometimes attributing the SPP to finite sets of binary words; such a set has the SPP if and only if any list of its distinct elements has the SPP.

To see that s1,...,s5in(∗) in the preceding example have the SPP, think about trying to parse a binary word W into a string of the si, scanning left to right. Because of the form of s1,...,s5, the parsing procedure can be described thus: scan until you come to the first zero, or until you have scanned four ones.

Jot down which sj you have identified and resume scanning. Pretty clearly this procedure will parse any W into a string of the siwith leavev = λ, 1, 11, or 111 (withλ denoting the empty string). It becomes clear that this parsing is always possible, and most would agree that the parsing is unique, on the grounds that there is never any choice or uncertainty about what happens next during the parsing. We will indicate a logically rigorous proof of uniqueness, in a general setting, in the exercises at the end of the next section.

Which sets of binary words have the SPP? We shall answer this question fully in the next section. But there is a large class of sets with the SPP that ought to be kept in mind, not least because these are the source alphabets that are commonly used in current data compression programs, not only with the zeroth-order replacement strategy under discussion here, but also with all combinations of higher order, adaptive, and/or arithmetic methods, to be looked at later. All of these methods start by parsing the original file W into a source string, a long word over the source alphabet S. The methods differ in what is then done with the source string.

The most common sort of choice for source alphabet is: S= {0,1}^L, the set of binary words of some fixed length L. Since computer files are commonly or-ganized into bytes, binary words of length 8, the choice L= 8 is very common.

Also, L= 12, a byte-and-a-half, seems popular.

If S= {0,1}^L, the process of parsing a binary word W into a source string amounts to chopping W into segments of length L. If L= 8 and the original file is already organized into bytes, that’s that; the parsing is immediate. There is another good reason for the choice L= 8. When information is stored byte by byte, it very often happens that you rarely need all 8 bits in the byte to record the datum, whatever it is. For instance, suppose there are only 55 different basic data messages to store, presumably in some significant order. You need only 6 bits (for a total of 2⁶= 64 possibilities) to accommodate the storage task, yet it is customary to store 1 datum per byte. Thus one can expect a compression ratio of at least 8/6 = 4/3 in this situation, just by deleting the 2 unused bits per byte.

Thus the historical accident that files are, sometimes inefficiently, organized into bytes, makes the choice L= 8 rather shrewd. The best zeroth-order replacement method, Huffman encoding, takes advantage of this inefficiency, and more. In the hypothetical situation mentioned above, we might expect something more like 8/log255 as a compression ratio, using S= {0,1}⁸and zeroth-order simple Huffman encoding. Details to follow!

Even though S= {0,1}^L is the most common sort of choice of source al-phabet, we do not want to limit our options, so we will continue to allow all S with the SPP; these S will be completely characterized in the next section.

A problem that may have occurred to the attentive anxious reader is: what effect should the leave have in the calculation of the compression ratio? In real life, the original binary word W to be parsed and compressed is quite long;

saving and pointing out the leave may require many more bits than the leave itself, but the added length to the compressed file will generally be negligible, compared to the total length of the file. For this reason, we shall ignore the leave in the calculation of the compression ratio. Therefore, if S= {s1,...,sm} and a file W parses into W = si₁···si_tv, lgth(v) < maxilgth(si), and if the si are replaced according to the encoding scheme si → wi, i = 1,...,m, the compression ratio will bet

j=1lgth(si_j)/t

j=1lgth(wi_j), regardless of v, by convention.

In document UNIVERSIDAD COMPLUTENSE DE MADRID (página 70-96)