• No se han encontrado resultados

1 template <typename Type>

2 void inverse_gray_permute(Type *f, ulong n)

3 {

4 [--snip--]

5 // --- do cycle: ---

6 ulong i = z | b.next(); // start of cycle

7 Type t = f[i]; // save start value

8 ulong g = gray_code(i); // next in cycle

9 for (ulong k=cl-1; k!=0; --k) 10 { 11 f[i] = f[g]; 12 i = g; 13 g = gray_code(i); 14 } 15 f[i] = t;

16 // --- end (do cycle) ---

17 [--snip--]

18 }

The Gray permutation is used with certain Walsh transforms, see section 23.7 on page 474.

2.12.3

Performance of the routines

We use the convention that the time for an array reversal is 1.0. The operation is completely cache-friendly and therefore fast. A simple benchmark gives for 16 MB arrays:

arg 1: 21 == ldn [Using 2**ldn elements] default=21 arg 2: 10 == rep [Number of repetitions] default=10 Memsize = 16384 kiloByte == 2097152 doubles

reverse(f,n); dt= 0.0103524 MB/s= 1546 rel= 1 revbin_permute(f,n); dt= 0.0674235 MB/s= 237 rel= 6.51282 revbin_permute0(f,n); dt= 0.061507 MB/s= 260 rel= 5.94131 gray_permute(f,n); dt= 0.0155019 MB/s= 1032 rel= 1.49742 inverse_gray_permute(f,n); dt= 0.0150641 MB/s= 1062 rel= 1.45512

The revbin permutation takes about 6.5 units, due to its memory access pattern that is very problematic with respect to cache usage. The Gray permutation needs only 1.50 units. The difference gets bigger for machines with relatively slow memory with respect to the CPU.

The relative speeds are quite different for small arrays. With 16 kB (2048doubles) we obtain

arg 1: 11 == ldn [Using 2**ldn elements] default=21 arg 2: 100000 == rep [Number of repetitions] default=512 Memsize = 16 kiloByte == 2048 doubles

reverse(f,n); dt=1.88726e-06 MB/s= 8279 rel= 1 revbin_permute(f,n); dt=3.22166e-06 MB/s= 4850 rel= 1.70706 revbin_permute0(f,n); dt=2.69212e-06 MB/s= 5804 rel= 1.42647 gray_permute(f,n); dt=4.75155e-06 MB/s= 3288 rel= 2.51769 inverse_gray_permute(f,n); dt=3.69237e-06 MB/s= 4232 rel= 1.95647

Due to the small size, the cache problems are gone.

2.13

The reversed Gray permutation

Thereversed Gray permutation of a length-n array is computed by permuting the elements in the way that the Gray permutation would permute the upper half of an array of length 2n. The array sizenmust be a power of 2. An implementation is [FXT: perm/grayrevpermute.h]:

1 template <typename Type>

2 inline void gray_rev_permute(const Type *f, Type * restrict g, ulong n)

3 // gray_rev_permute() =^=

4 // { reverse(); gray_permute(); }

5 {

6 for (ulong k=0, m=n-1; k<n; ++k, --m) g[gray_code(m)] = f[k];

7 }

0: [ * ] 0: [ * ] 1: [ * ] 1: [ * ] 2: [ * ] 2: [ * ] 3: [ * ] 3: [ * ] 4: [ * ] 4: [ * ] 5: [ * ] 5: [ * ] 6: [ * ] 6: [ * ] 7: [ * ] 7: [ * ] 8: [ * ] 8: [ * ] 9: [ * ] 9: [ * ] 10: [ * ] 10: [ * ] 11: [ * ] 11: [ * ] 12: [ * ] 12: [ * ] 13: [ * ] 13: [ * ] 14: [ * ] 14: [ * ] 15: [ * ] 15: [ * ]

Figure 2.13-A: Permutation matrices of the reversed Gray permutation (left) and its inverse (right).

0: ( 0, 63, 21, 38, 4, 56, 16, 32) #=8 1: ( 1, 62, 20, 39, 5, 57, 17, 33) #=8 2: ( 2, 60, 23, 37, 6, 59, 18, 35) #=8 3: ( 3, 61, 22, 36, 7, 58, 19, 34) #=8 4: ( 8, 48, 31, 42, 12, 55, 26, 44) #=8 5: ( 9, 49, 30, 43, 13, 54, 27, 45) #=8 6: ( 10, 51, 29, 41, 14, 52, 24, 47) #=8 7: ( 11, 50, 28, 40, 15, 53, 25, 46) #=8 64 elements in 8 nontrivial cycles. cycle length is == 8

No fixed points.

If 64 is added to the indices, the cycles in the upper half of the array as ingray_permute(f, 128)are reproduced. The in-place version of the permutation routine is

1 template <typename Type>

2 void gray_rev_permute(Type *f, ulong n)

3 // n must be a power of 2, n<=2**(BITS_PER_LONG-2)

4 {

5 f -= n; // note!

6

7 ulong z = 1; // mask for cycle maxima

8 ulong v = 0; // ~z

9 ulong cl = 1; // cycle length

10 ulong ldm, m; 11 for (ldm=1, m=2; m<=n; ++ldm, m<<=1) 12 { 13 z <<= 1; v <<= 1; 14 if ( is_pow_of_2(ldm) ) { ++z; cl<<=1; } 15 else ++v; 16 } 17 18 ulong tv = v, tu = 0; // cf. bitsubset.h 19 do 20 { 21 tu = (tu-tv) & tv;

22 ulong i = z | tu; // start of cycle

23 24 // --- do cycle: --- 25 ulong g = gray_code(i); 26 Type t = f[i]; 27 for (ulong k=cl-1; k!=0; --k) 28 { 29 Type tt = f[g]; 30 f[g] = t; 31 t = tt; 32 g = gray_code(g); 33 } 34 f[g] = t;

35 // --- end (do cycle) ---

36 }

2.13: The reversed Gray permutation 133

38 }

The routine for the inverse permutation again differs only in the way the cycles are processed:

1 template <typename Type>

2 void inverse_gray_rev_permute(Type *f, ulong n)

3 {

4 [--snip--]

5 // --- do cycle: ---

6 Type t = f[i]; // save start value

7 ulong g = gray_code(i); // next in cycle

8 for (ulong k=cl-1; k!=0; --k) 9 { 10 f[i] = f[g]; 11 i = g; 12 g = gray_code(i); 13 } 14 f[i] = t;

15 // --- end (do cycle) ---

16 [--snip--]

17 }

Let G denote the Gray permutation, G the reversed Gray permutation, r be the reversal, hthe swap of the upper and lower halves, and Xa the XOR permutation (with parametera) from section 2.11 on

page 127. We have

G = G r = h G (2.13-1a)

G−1 = r G−1 (2.13-1b)

G−1G = G−1G = r = Xn−1 (2.13-1c)

Chapter 3

Sorting and searching

We give various sorting algorithms and some practical variants of them, like sorting index arrays and pointer sorting. Searching methods both for sorted and for unsorted arrays are described. Finally we give methods for the determination of equivalence classes.

3.1

Sorting algorithms

We give sorting algorithms like selection sort, quicksort, merge sort, counting sort and radix sort. A massive amount of literature exists about the topic so we will not explore the details. Very readable texts are [115] and [306], while in-depth information can be found in [214].

3.1.1

Selection sort

[ n o w s o r t m e ] [ e o w s o r t m n ] [ m w s o r t o n ] [ n s o r t o w ] [ o o r t s w ] [ o r t s w ] [ r t s w ] [ s t w ] [ t w ] [ w ] [ e m n o o r s t w ]

Figure 3.1-A:Sorting the string ‘nowsortme’ with the selection sort algorithm. There are a several algorithms for sorting that have complexityO n2

wheren is the size of the array to be sorted. Here we use selection sort, where the idea is to find the minimum of the array, swap it with the first element, and repeat for all elements but the first. A demonstration of the algorithm is shown in figure 3.1-A, this is the output of [FXT: sort/selection-sort-demo.cc]. The implementation is straightforward [FXT: sort/sort.h]:

1 template <typename Type>

2 void selection_sort(Type *f, ulong n)

3 // Sort f[] (ascending order).

4 // Algorithm is O(n*n), use for short arrays only.

5 {

6 for (ulong i=0; i<n; ++i)

7 {

8 Type v = f[i];

9 ulong m = i; // position of minimum

10 ulong j = n;

11 while ( --j > i ) // search (index of) minimum

12 { 13 if ( f[j]<v ) 14 { 15 m = j; 16 v = f[m]; 17 } 18 }

3.1: Sorting algorithms 135

19

20 swap2(f[i], f[m]);

21 }

22 }

A verification routine is always handy:

1 template <typename Type>

2 bool is_sorted(const Type *f, ulong n)

3 // Return whether the sequence f[0], f[1], ..., f[n-1] is ascending.

4 {

5 for (ulong k=1; k<n; ++k) if ( f[k-1] > f[k] ) return false;

6 return true;

7 }

A test for descending order is

1 template <typename Type>

2 bool is_falling(const Type *f, ulong n)

3 // Return whether the sequence f[0], f[1], ..., f[n-1] is descending.

4 {

5 for (ulong k=1; k<n; ++k) if ( f[k-1] < f[k] ) return false;

6 return true;

7 }

3.1.2

Quicksort

The quicksort algorithm is given in [183], it has complexity O(nlog(n)) (in the average case). It does not obsolete the simpler schemes, because for small arrays the simpler algorithms are usually faster, due to their minimal bookkeeping overhead.

The main activity of quicksort is partitioning the array. The corresponding routine reorders the array and returns apivot index pso that max(f0, . . . , fp−1)≤ min(fp, . . . , fn−1) [FXT: sort/sort.h]:

1 template <typename Type>

2 ulong partition(Type *f, ulong n)

3 {

4 // Avoid worst case with already sorted input:

5 const Type v = median3(f[0], f[n/2], f[n-1]);

6

7 ulong i = 0UL - 1;

8 ulong j = n;

9 while ( 1 )

10 {

11 do { ++i; } while ( f[i]<v );

12 do { --j; } while ( f[j]>v ); 13 14 if ( i<j ) swap2(f[i], f[j]); 15 else return j; 16 } 17 }

The functionmedian3()is defined in [FXT: sort/minmaxmed23.h]:

1 template <typename Type>

2 static inline Type median3(const Type &x, const Type &y, const Type &z)

3 // Return median of the input values

4 { return x<y ? (y<z ? y : (x<z ? z : x)) : (z<y ? y : (z<x ? z : x)); }

The function does 2 or 3 comparisons, depending on the input. One could simply use the elementf[0]

as pivot. However, the algorithm will needO(n2) operations when the array is already sorted.

Quicksort calls partition on the whole array, then on the two parts left and right from the partition index, then for the four, eight, etc. parts, until the parts are of length one. Note that the sub-arrays are usually of different lengths.

1 template <typename Type>

2 void quick_sort(Type *f, ulong n)

3 { 4 if ( n<=1 ) return; 5 6 ulong p = partition(f, n); 7 ulong ln = p + 1; 8 ulong rn = n - ln;

9 quick_sort(f, ln); // f[0] ... f[ln-1] left

10 quick_sort(f+ln, rn); // f[ln] ... f[n-1] right

11 }

The actual implementation uses two optimizations: Firstly, if the number of elements to be sorted is less than a certain threshold, selection sort is used. Secondly, the recursive calls are made for the smaller of the two sub-arrays, thereby the stack size is bounded bydlog2(n)e.

1 template <typename Type>

2 void quick_sort(Type *f, ulong n)

3 {

4 start:

5 if ( n<8 ) // parameter: threshold for nonrecursive algorithm

6 { 7 selection_sort(f, n); 8 return; 9 } 10 11 ulong p = partition(f, n); 12 ulong ln = p + 1; 13 ulong rn = n - ln; 14

15 if ( ln>rn ) // recursion for shorter sub-array

16 { 17 quick_sort(f+ln, rn); // f[ln] ... f[n-1] right 18 n = ln; 19 } 20 else 21 { 22 quick_sort(f, ln); // f[0] ... f[ln-1] left 23 n = rn; 24 f += ln; 25 } 26 27 goto start; 28 }

The quicksort algorithmwill be quadratic with certain inputs. A clever method to construct such inputs is described in [247]. Theheapsort algorithm is in-place andO(nlog(n)) (also in the worst case). It is described in section 3.1.5 on page 141. Inputs that lead to quadratic time for the quicksort algorithm with median-of-3 partitioning are described in [257]. The paper suggests to use quicksort, but to detect problematic behavior during runtime and switch to heapsort if needed. The corresponding algorithm is calledintrosort (forintrospective sorting).

3.1.3

Counting sort and radix sort

We want to sort an n-element array F of (unsigned) 8-bit values. A sorting algorithm which involves only 2 passes through the data proceeds as follows:

1. Allocate an arrayC of 256 integers and set all its elements to zero. 2. Count: fork= 0,1, . . . , n−1 incrementC[F[k]].

NowC[x] contains the number of bytes inF with the valuex. 3. Setr= 0. Forj= 0,1, . . . ,255

set k=C[j], then set the elementsF[r], F[r+ 1], . . . , F[r+k−1] toj, and add ktor.

For large values of nthis method is significantly faster than any other sorting algorithm. Note that no comparisons are made between the elements ofF. Instead they are counted, the algorithm is thecounting sort algorithm.

It might seem that the idea applies only to very special cases but with a little care it can be used in more general situations. We modify the method so that we are able to sort also (unsigned) integer variables whose range of values would make the method impractical with respect to a subrange of the bits in each word. We need an arrayGthat has as many elements asF:

1. Choose any consecutive run ofbbits, these will be represented by a bit maskm. Allocate an array C of 2b integers and set all its elements to zero.

3.1: Sorting algorithms 137 2. LetM be a function that maps the (2b) values of interest (the bits masked out bym) to the range

0,1, . . . , 2b1.

3. Count: fork= 0,1, . . . , n−1 incrementC[M(F[k])]. NowC[x] contains how many values ofM(F[.]) equalx.

4. Cumulate: for j= 1,2, . . . ,2b1 (second to last) addC[j1] toC[j].

NowC[x] contains the number of valuesM(F[.]) less than or equal to x. 5. Copy: fork=n−1, . . . ,2,1,0 (last to first), do as follows:

set x:=M(F[k]), decrementC[x], seti:=C[x], and setG[i] :=F[x].

A crucial property of the algorithm is that it is stable: if two (or more) elements compare equal (with respect to a certain bit-maskm), then the relative order between these elements is preserved.

Input Counting sort wrt. two lowest bits m = ...11 0: 11111.11< 0: ....1... 1: ....1... 1: ..1111.. 2: ...1.1.1 2: .111.... 3: ..1...1. 3: ...1.1.1 4: ..1.1111< 4: .1..1..1 5: ..1111.. 5: ..1...1. 6: .1..1..1 6: .1.1.11. 7: .1.1.11. 7: 11111.11< 8: .11...11< 8: ..1.1111< 9: .111.... 9: .11...11<

The relative order of the three words ending with two set bits (marked with ‘<’) is preserved.

A routine that verifies whether an array is sorted with respect to a bit range specified by the variableb0

andmis [FXT: sort/radixsort.cc]:

1 bool

2 is_counting_sorted(const ulong *f, ulong n, ulong b0, ulong m)

3 // Whether f[] is sorted wrt. bits b0,...,b0+z-1

4 // where z is the number of bits set in m.

5 // m must contain a single run of bits starting at bit zero.

6 { 7 m <<= b0; 8 for (ulong k=1; k<n; ++k) 9 { 10 ulong xm = (f[k-1] & m ) >> b0; 11 ulong xp = (f[k] & m ) >> b0; 12 if ( xm>xp ) return false; 13 } 14 return true; 15 }

The function M is the combination of a mask-out and a shift operation. A routine that sorts according tob0andmis:

1 void

2 counting_sort_core(const ulong * restrict f, ulong n, ulong * restrict g, ulong b0, ulong m)

3 // Write to g[] the array f[] sorted wrt. bits b0,...,b0+z-1

4 // where z is the number of bits set in m.

5 // m must contain a single run of bits starting at bit zero.

6 { 7 ulong nb = m + 1; 8 m <<= b0; 9 ALLOCA(ulong, cv, nb); 10 for (ulong k=0; k<nb; ++k) cv[k] = 0; 11 12 // --- count: 13 for (ulong k=0; k<n; ++k) 14 { 15 ulong x = (f[k] & m ) >> b0; 16 ++cv[ x ]; 17 } 18 19 // --- cumulative sums: 20 for (ulong k=1; k<nb; ++k) cv[k] += cv[k-1]; 21 22 // --- reorder: 23 ulong k = n;

24 while ( k-- ) // backwards ==> stable sort

26 ulong fk = f[k]; 27 ulong x = (fk & m) >> b0; 28 --cv[x]; 29 ulong i = cv[x]; 30 g[i] = fk; 31 } 32 }

Input Stage 1 Stage 2 Stage 3

m = ....11 m = ..11.. m = 11.... vv vv vv 111.11 ..1... 11.... ..1... ..1... 1111.. 1...1. ..1..1 .1.1.1 11.... 1...11 .1.1.1 1...1. .1.1.1 .1.1.1 .1.11. 1.1111 ..1..1 .1.11. 1...1. 1111.. 1...1. ..1... 1...11 ..1..1 .1.11. ..1..1 1.1111 .1.11. 111.11 111.11 11.... 1...11 1.1111 1111.. 111.11 11.... 1...11 1.1111 1111..

Figure 3.1-B:Radix sort of 10 six-bit values when using two-bit masks.

Now we can apply counting sort to a set of bit masks that cover the whole range. Figure 3.1-B shows an example with 10 six-bit values and 3 two-bit masks, starting from the least significant bits. This is the output of the program [FXT: sort/radixsort-demo.cc].

The following routine uses 8-bit masks to sort unsigned integers [FXT: sort/radixsort.cc]:

1 void

2 radix_sort(ulong *f, ulong n)

3 {

4 ulong nb = 8; // Number of bits sorted with each step

5 ulong tnb = BITS_PER_LONG; // Total number of bits

6

7 ulong *fi = f;

8 ulong *g = new ulong[n];

9 10 ulong m = (1UL<<nb) - 1; 11 for (ulong k=1, b0=0; b0<tnb; ++k, b0+=nb) 12 { 13 counting_sort_core(f, n, g, b0, m); 14 swap2(f, g); 15 } 16

17 if ( f!=fi ) // result is actually in g[]

18 { 19 swap2(f, g); 20 for (ulong k=0; k<n; ++k) f[k] = g[k]; 21 } 22 23 delete [] g; 24 }

There is room for optimization. Combining copying with counting for the next pass (where possible) would reduce the number of passes almost by a factor of 2.

A version of radix sort that starts from the most significant bits is given in [306].

3.1.4

Merge sort

The merge sort algorithm is a method for sorting with complexity O(nlog(n)). We need a routine that copies two sorted arrays A and B into an array T such that T is in sorted order. The following implementation requires thatAand B are adjacent in memory [FXT: sort/merge-sort.h]:

1 template <typename Type>

2 void merge(Type * const restrict f, ulong na, ulong nb, Type * const restrict t)

3 // Merge the (sorted) arrays

4 // A[] := f[0], f[1], ..., f[na-1] and B[] := f[na], f[na+1], ..., f[na+nb-1]

5 // into t[] := t[0], t[1], ..., t[na+nb-1] such that t[] is sorted.

3.1: Sorting algorithms 139 [ n o w s o r t m e A D B A C D 5 4 3 2 1 ] [ n o o s w ] [ A e m r t ] [ A e m n o o r s t w ] [ A B C D D ] [ 1 2 3 4 5 ] [ 1 2 3 4 5 A B C D D ] [ A e m n o o r s t w ] [ 1 2 3 4 5 A B C D D ] [ 1 2 3 4 5 A A B C D D e m n o o r s t w ]

Figure 3.1-C: Sorting with the merge sort algorithm.

7 {

8 const Type * const A = f;

9 const Type * const B = f + na;

10 ulong nt = na + nb; 11 Type ta = A[--na], tb = B[--nb]; 12 13 while ( true ) 14 { 15 if ( ta > tb ) // copy ta 16 { 17 t[--nt] = ta;

18 if ( na==0 ) // A[] empty?

19 {

20 for (ulong j=0; j<=nb; ++j) t[j] = B[j]; // copy rest of B[]

21 return;

22 }

23

24 ta = A[--na]; // read next element of A[]

25 } 26 else // copy tb 27 { 28 t[--nt] = tb; 29 if ( nb==0 ) // B[] empty? 30 {

31 for (ulong j=0; j<=na; ++j) t[j] = A[j]; // copy rest of A[]

32 return;

33 }

34

35 tb = B[--nb]; // read next element of B[]

36 }

37 }

38 }

Two branches are involved, the unavoidable branch with the comparison of the elements, and the test for empty array where an element has been removed.

We could sort by merging adjacent blocks of growing size as follows:

[ h g f e d c b a ] // input [ g h e f c d a b ] // merge pairs

[ e f g h a b c d ] // merge adjacent runs of two [ a b c d e f g h ] // merge adjacent runs of four

For a more localized memory access, we use a depth first recursion (compare with the binsplit recursion in section 34.1.1.1 on page 651):

1 template <typename Type>

2 void merge_sort_rec(Type *f, ulong n, Type *t)

3 { 4 if ( n<8 ) 5 { 6 selection_sort(f, n); 7 return; 8 } 9 10 const ulong na = n>>1;

11 const ulong nb = n - na;

13 // PRINT f[0], f[1], ..., f[na-1]

14 merge_sort_rec(f, na, t);

15 // PRINT f[na], f[na+1], ..., f[na+nb-1]

16 merge_sort_rec(f+na, nb, t);

17

18 merge(f, na, nb, t);

19 for (ulong j=0; j<n; ++j) f[j] = t[j]; // copy back

20 // PRINT f[0], f[1], ..., f[na+nb-1]

21 }

The commentsPRINTindicate the print statements in the program [FXT: sort/merge-sort-demo.cc] that was used to generate figure 3.1-C. The method is (obviously) not in-place. The routine called by the user is

1 template <typename Type>

2 void merge_sort(Type *f, ulong n, Type *tmp=0)

3 { 4 Type *t = tmp; 5 if ( tmp==0 ) t = new Type[n]; 6 merge_sort_rec(f, n, t); 7 if ( tmp==0 ) delete [] t; 8 } Optimized algorithm F: [ n o w s o r t m e A D B A C D 5 4 3 2 1 ] F: [ n o o s w ] F: [ A e m r t ] T: [ A e m n o o r s t w ] F: [ A B C D D ] F: [ 1 2 3 4 5 ] T: [ 1 2 3 4 5 A B C D D ] F: [ 1 2 3 4 5 A A B C D D e m n o o r s t w ]

Figure 3.1-D:Sorting with the 4-way merge sort algorithm.

The copying fromT to F in the recursive routine can be avoided by a 4-way splitting scheme. We sort the left two quarters and merge them intoT, then we sort the right two quarters and merge them into T+na. Then we mergeT andT+na intoF. Figure 3.1-D shows an example where only one recursive

step is involved. It was generated with the program [FXT: sort/merge-sort4-demo.cc]. The recursive routine is [FXT: sort/merge-sort.h]

1 template <typename Type>

2 void merge_sort_rec4(Type *f, ulong n, Type *t)

3 {

4 if ( n<8 ) // threshold must be at least 8

5 {

6 selection_sort(f, n);

7 return;

8 }

9

10 // left and right half:

11 const ulong na = n>>1;

12 const ulong nb = n - na;

13

14 // left quarters:

15 const ulong na1 = na>>1;

16 const ulong na2 = na - na1;

17 merge_sort_rec4(f, na1, t); 18 merge_sort_rec4(f+na1, na2, t); 19 20 // right quarters: 21 const ulong nb1 = nb>>1; 22 const ulong nb2 = nb - nb1; 23 merge_sort_rec4(f+na, nb1, t); 24 merge_sort_rec4(f+na+nb1, nb2, t); 25 26 // merge quarters (F-->T):

27 merge(f, na1, na2, t);

28 merge(f+na, nb1, nb2, t+na);

Documento similar