A closed hash table keeps the members of the dictionary in the bucket table itself, rather than using that table to store list headers. As a consequence, it appears that we can put only one element in any bucket. However, associated with closed hashing is a
rehash strategy. If we try to place x in bucket h(x) and find it already holds an
element, a situation called a collision occurs, and the rehash strategy chooses a sequence of alternative locations, h1(x), h2(x), . . . , within the bucket table in which we could place x. We try each of these locations, in order, until we find an empty one. If none is empty, then the table is full, and we cannot insert x.
Example 4.3. Suppose B = 8 and keys a, b, c, and d have hash values h(a) = 3, h(b) =
0, h(c) = 4, h(d) = 3. We shall use the simplest rehash strategy, called linear hashing, where hi(x) = (h(x) + i) mod B. Thus, for example, if we wished to insert a and found bucket 3 already filled, we would try buckets 4, 5, 6, 7, 0, 1, and 2, in that order. Initially, we assume the table is empty, that is, each bucket holds a special value
empty, which is not equal to any value we might try to insert.† If we insert a, b, c, and
d, in that order, into an initially empty table, we find a goes into bucket 3, b into 0
try h1(d) = 4 and find it filled as well. Finally, when we try h2(d) = 5, we find an empty space and put d there. The resulting positions are shown in Fig. 4.13.
Fig. 4.13. Partially filled hash table.
The membership test for an element x requires that we examine h(x), hi(x), h2(x), . . . , until we either find x or find an empty bucket. To see why we can stop looking if we reach an empty bucket, suppose first that there are no deletions permitted. If, say,
h3(x) is the first empty bucket found in the series, it is not possible that x is in bucket
h4(x), h5(x), or further along the sequence, because x could not have been placed there unless h3(x) were filled at the time we inserted x.
Note, however, that if we allow deletions, then we can never be sure, if we reach an empty bucket without finding x, that x is not somewhere else, and the now empty bucket was occupied when x was inserted. When we must do deletions, the most effective approach is to place a constant called deleted into a bucket that holds an element that we wish to delete. It is important that we can distinguish between deleted and empty, the constant found in all buckets that have never been filled. In this way it is possible to permit deletions without having to search the entire table during a MEMBER test. When inserting we can treat deleted as an available space, so with luck the space of a deleted element will eventually be reused. However, as the space of a deleted element is not immediately reclaimable as it is with open hashing, we may prefer the open to the closed scheme.
Example 4.4. Suppose we wish to test if e is in the set represented by Fig. 4.13. If
h(e) = 4, we try buckets 4, 5, and then 6. Bucket 6 is empty and since we have not
found e, we conclude e is not in the set.
If we delete c, we must put the constant deleted in bucket 4. In that way, when we look for d, and we begin at h(d) = 3, we shall scan 4 and 5 to find d, and not stop at 4 as we would if we put empty in that bucket.
In Fig 4.14 we see the type declarations and operations for the DICTIONARY ADT with set members of type nametype and the closed hash table as the underlying implementation. We use an arbitrary hash function h, of which Fig. 4.11 is one
convenience, we have identified empty with a string of ten blanks and deleted with a string of 10 *'s, assuming that neither of these strings could be real keys. (In the SPIT database these strings are unlikely to be names of legislators.) The procedure
INSERT(x, A ) first uses locate to determine if x is already in A, and, if not, it uses a special routine locate 1 to find a location in which to insert x. Note that locate 1 searches for deleted as well as empty locations.
const
empty = ' '; { 10 blanks }
deleted = '**********'{ 10 *'s } type
DICTIONARY = array[0..B -1] of nametype; procedure MAKENULL ( var A: DICTIONARY ); var i: integer; begin for i := 0 to B - 1 do A [i] := empty end; {MAKENULL}
function locate ( x: nametype; A: DICTIONARY ) : integer; { locate scans DICTIONARY from the bucket for h(x) until either x is found, or an empty bucket is found, or it has scanned completely around the table, thereby determining that the table does not contain x. locate returns the index of the bucket at which it stops for any of these reasons } var
initial, i: integer;
{ initial holds h(x); i counts the number of
buckets thus far scanned when looking for x } begin
initial := h(x); i := 0;
while (i < B) and (A [(initial +
i) mod B ] <> x) and
(A [(initial + i) mod B] <>
empty) do
i := i + 1;
return ((initial + i) mod B) end; { locate }
function locate 1 ( x: nametype; A: DICTIONARY ) : integer; { like locate, but it will also stop at and return a deleted entry } function MEMBER ( x: nametype; var A: DICTIONARY ) :
boolean; begin if A [locate(x)] = x then return (true) else return (false) end; { MEMBER }
procedure INSERT (x: nametype; var A: DICTIONARY ); var
bucket: integer; begin
if A[locate (x)] = x then return; { x is already in A }
bucket := locate 1(x);
if (A [bucket] = empty) or (A [bucket] = deleted) then
A [bucket] := x else
error('INSERT failed: table is full') end; { INSERT }
procedure DELETE ( x: nametype; var A: DICTIONARY ); var bucket: integer; begin bucket := locate(x); if A [bucket] = x then A [bucket] := deleted end; { DELETE }
Fig. 4.14. Dictionary implementation by a closed hash table.
4.8 Estimating the Efficiency of Hash
Functions
As we have mentioned, hashing is an efficient way to represent dictionaries and some other abstract data types based on sets. In this section we shall examine the average time per dictionary operation in an open hash table. If there are B buckets and N elements stored in the hash table, then the average bucket has N/B members, and we expect that an average INSERT, DELETE, or MEMBER operation will take O(1 +
N/B) time. The constant 1 represents the time to find the bucket, and N/B the time to
operation. Thus the average time to insert, delete, or test an element for membership, assuming the element is equally likely to be hashed to any bucket, is a constant independent of N.
Suppose we are given a program written in some language like Pascal, and we wish to insert all identifiers appearing in the program into a hash table. Whenever a declaration of a new identifier is encountered, the identifier is inserted into the hash table after checking that it is not already there. During this phase it is reasonable to assume that an identifier is equally likely to be hashed into any given bucket. Thus we can construct a hash table with N elements in time O(N(1 + N/B)). By choosing B equal to N this becomes O(N).
In the next phase identifiers are encountered in the body of the program. We must locate the identifiers in the hash table to retrieve information associated with them. But what is the expected time to locate an identifier? If the element searched for is equally likely to be any element in the table, then the expected time to search for it is just the average time spent inserting an element. To see this, observe that in searching once for each element in the table, the time spent is exactly the same as the time spent inserting each element, assuming that elements are always appended to the end of the list for the appropriate bucket. Thus the expected time for a search is also 0(1 + N/B). The above analysis assumes that a hash function distributes elements uniformly over the buckets. Do such functions exist? We can regard a function such as that of Fig. 4.11 (convert characters to integers, sum and take the remainder modulo B) as a typical hash function. The following example examines the performance of this hash function.
Example 4.5. Suppose we use the hash function of Fig. 4.11 to hash the 100 keys
consisting of the character strings A0, A1, . . . , A99 into a table of 100 buckets. On the assumption that ord(O), ord(1), . . . , ord(9) form an arithmetic progression, as they do in the common character codes ASCII and EBCDIC, it is easy to check that the keys hash into at most 29 of the 100 buckets,† and the largest bucket contains A18, A27, A36, . . . , A90, or 9 of the 100 elements. If we compute the average number of steps for insertion, using the fact that the insertion of the ith element into a
bucket takes i + 1 steps, we get 395 steps for these 100 keys. In comparison, our estimate N(1 + N/B) would suggest 200 steps.
The simple hashing function of Fig. 4.11 may treat certain sets of inputs, such as the consecutive strings in Example 4.5, in a nonrandom manner. "More random" hash functions are available. As an example, we can use the idea of squaring and taking middle digits. Thus, if we have a 5-digit number n as a key and square it, we get a 9 or 10 digit number. The "middle digits," such as the 4th through 7th places from the
right, have values that depend on almost all the digits of n. For example, digit 4 depends on all but the leftmost digit of n, and digit 5 depends on all digits of n. Thus, if B = 100, we might choose to take digits 6 and 5 to form the bucket number.
This idea can be generalized to situations where B is not a power of 10. Suppose keys are integers in the range 0, 1, . . . , K. If we pick an integer C such that BC2 is about equal to K2, then the function
h(n) = [n2/C] mod B
effectively extracts a base-B digit from the middle of n2.
Example 4.6. If K = 1000 and B = 8, we might choose C = 354. Then
To use the "square and take the middle" strategy when keys are character strings, first group the characters in the string from the right into blocks of a fixed number of characters, say 4, padding the left with blanks, if necessary. Treat each block as a single integer, formed by concatenating the binary codes for the characters. For example, ASCII uses a 7-bit character code, so characters may be viewed as "digits" in base 27, or 128. Thus we can regard the character string abcd as the integer (128)3a
+ (128)2b + (128)c + d. After converting all blocks to integers, add them† and
proceed as we have suggested previously for integers.