Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Data-Structures-And-Algorithms-Alfred-V-Aho

.pdf
Скачиваний:
122
Добавлен:
09.04.2015
Размер:
6.91 Mб
Скачать

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

therefore places no limit on the size of the set. The second, called closed or internal hashing, uses a fixed space for storage and thus limits the size of sets.

Open Hashing

In Fig. 4.10 we see the basic data structure for open hashing. The essential idea is that the (possibly infinite) set of potential set members is partitioned into a finite number of classes. If we wish to have B classes, numbered 0, 1, . . . , B-1, then we use a hash function h such that for each object x of the data type for members of the set being represented, h(x) is one of the integers 0 through B-1. The value of h(x), naturally, is the class to which x belongs. We often call x the key and h(x) the hash value of x. The "classes" we shall refer to as buckets, and we say that x belongs to bucket h(x).

In an array called the bucket table, indexed by the bucket numbers 0, 1, . . . , B-1, we have headers for B lists. The elements on the ith list are the members of the set being represented that belong to class i, that is, the elements x in the set such that h(x) = i.

It is our hope that the buckets will be roughly equal in size, so the list for each bucket will be short. If there are N elements in the set, then the average bucket will have N/B members. If we can estimate N and choose B to be roughly as large, then the average bucket will have only one or two members, and the dictionary operations take, on the average, some small constant number of steps, independent of what N (or equivalently B) is.

Fig. 4.10. The open hashing data organization.

It is not always clear that we can select h so that a typical set will have its members distributed fairly evenly among the buckets. We shall later have more to say about selecting h so that it truly "hashes" its argument, that is, so that h(x) is a "random" value that is not dependent on x in any trivial way.

function h ( x: nametype ) : O..B - 1; var

i, sum: integer begin

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (17 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

sum: = 0;

for i: 1 to 10 do

sum: = sum + ord(x[i]); h : = sum + ord(x[i]);

end { h }

Fig. 4.11. A simple hash function h.

Let us here, for specificity, introduce one hash function on character strings that is fairly good, although not perfect. The idea is to treat the characters as integers, using the character code of the machine to define the correspondence. Pascal provides the ord built-in function, where ord(c) is the integer code for character c. Thus, if x is a key and the type of keys is array[l..10] of char (what we called nametype in Example 4.2), we might declare the hash function h as in Fig. 4.11. In that function we sum the integers for each character and divide the result by B, taking the remainder, which is an integer from 0 to B - 1.

In Fig. 4.12 we see the declarations of the data structures for an open hash table and the procedures implementing the operations for a dictionary. The member type for the dictionary is assumed to be nametype (a character array), so these declarations can be used in Example 4.2 directly. We should note that in Fig. 4.12 we have made the headers of the bucket lists be pointers to cells, rather than complete cells. We did so because the bucket table, where the headers are, could take as much space as the lists themselves if we made the bucket be an array of cells rather than pointers. Notice, however, that we pay a price for saving this space. The code for the DELETE procedure now must distinguish the first cell from the remaining ones.

const

B = { suitable constant }; type

celltype = record element: nametype; next: − celltype

end;

DICTIONARY = array[0..B - l] of − celltype;

procedure MAKENULL ( var A: DICTIONARY ); var

i: integer; begin

for i:= 0 to B-1 do

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (18 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

A [i]:= nil end; { MAKENULL }

function MEMBER ( x: nametype; var A: DICTIONARY ) : boolean; var

current: − celltype; begin

current := A [h (x)];

{ initially current is the header of x's bucket } while current < > nil do

if current−.element =

x then

return (true) else

current := current−.next; return (false) { if x is not found }

end; { MEMBER }

procedure INSERT ( x: nametype; var A: DICTIONARY ); var

bucket: integer; oldheader: − celltype;

begin

if not MEMBER(x, A) then begin bucket := h(x);

oldheader := A [bucket ]; new (A [bucket]);

A [bucket ]−.element :=

x;

A [bucket ]−.next: =

oldheader end

end; { INSERT }

procedure DELETE ( x: nametype; var A: DICTIONARY ); var

current: − celltype; bucket: integer;

begin

bucket := h(x);

if A [bucket ] < > nil then begin if A [bucket]−.element

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (19 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

= x then { x is in first cell }

A [bucket ] := A [bucket ]−.next { remove x from list } else begin { x, if present at all, is not in first cell of bucket }

current := A [bucket ]; { current points to the previous

cell }

while current−.next <> nil

do

if current−.next−.element = x then begin current−.next :=

current−.next−.next;

{ remove x from list } return { break }

end

else {x not yet found} current := current−.next

end

end

end; { DELETE }

Fig. 4.12. Dictionary implementation by an open hash table.

Closed Hashing

A closed hash table keeps the members of the dictionary in the bucket table itself, rather than using that table to store list headers. As a consequence, it appears that we can put only one element in any bucket. However, associated with closed hashing is a rehash strategy. If we try to place x in bucket h(x) and find it already holds an element, a situation called a collision occurs, and the rehash strategy chooses a sequence of alternative locations, h1(x), h2(x), . . . , within the bucket table in which we could place x. We try each of these locations, in order, until we find an empty one. If none is empty, then the table is full, and we cannot insert x.

Example 4.3. Suppose B = 8 and keys a, b, c, and d have hash values h(a) = 3, h(b) = 0, h(c) = 4, h(d) = 3. We shall use the simplest rehash strategy, called linear hashing, where hi(x) = (h(x) + i) mod B. Thus, for example, if we wished to insert a and found bucket 3 already filled, we would try buckets 4, 5, 6, 7, 0, 1, and 2, in that order.

Initially, we assume the table is empty, that is, each bucket holds a special value empty, which is not equal to any value we might try to insert.If we insert a, b, c, and d, in that order, into an initially empty table, we find a goes into bucket 3, b into 0 and c into 4. When we try to insert d, we first try h(d) = 3 and find it filled. Then we

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (20 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

try h1(d) = 4 and find it filled as well. Finally, when we try h2(d) = 5, we find an empty space and put d there. The resulting positions are shown in Fig. 4.13.

Fig. 4.13. Partially filled hash table.

The membership test for an element x requires that we examine h(x), hi(x), h2(x), .

. . , until we either find x or find an empty bucket. To see why we can stop looking if we reach an empty bucket, suppose first that there are no deletions permitted. If, say, h3(x) is the first empty bucket found in the series, it is not possible that x is in bucket h4(x), h5(x), or further along the sequence, because x could not have been placed there unless h3(x) were filled at the time we inserted x.

Note, however, that if we allow deletions, then we can never be sure, if we reach an empty bucket without finding x, that x is not somewhere else, and the now empty bucket was occupied when x was inserted. When we must do deletions, the most effective approach is to place a constant called deleted into a bucket that holds an element that we wish to delete. It is important that we can distinguish between deleted and empty, the constant found in all buckets that have never been filled. In this way it is possible to permit deletions without having to search the entire table during a MEMBER test. When inserting we can treat deleted as an available space, so with luck the space of a deleted element will eventually be reused. However, as the space of a deleted element is not immediately reclaimable as it is with open hashing, we may prefer the open to the closed scheme.

Example 4.4. Suppose we wish to test if e is in the set represented by Fig. 4.13. If h(e) = 4, we try buckets 4, 5, and then 6. Bucket 6 is empty and since we have not found e, we conclude e is not in the set.

If we delete c, we must put the constant deleted in bucket 4. In that way, when we look for d, and we begin at h(d) = 3, we shall scan 4 and 5 to find d, and not stop at 4 as we would if we put empty in that bucket.

In Fig 4.14 we see the type declarations and operations for the DICTIONARY ADT with set members of type nametype and the closed hash table as the underlying implementation. We use an arbitrary hash function h, of which Fig. 4.11 is one possibility, and we use the linear hashing strategy to rehash in case of collisions. For

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (21 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

convenience, we have identified empty with a string of ten blanks and deleted with a string of 10 *'s, assuming that neither of these strings could be real keys. (In the SPIT database these strings are unlikely to be names of legislators.) The procedure INSERT(x, A ) first uses locate to determine if x is already in A, and, if not, it uses a special routine locate 1 to find a location in which to insert x. Note that locate 1 searches for deleted as well as empty locations.

const

empty = ' '; { 10 blanks } deleted = '**********'{ 10 *'s }

type

DICTIONARY = array[0..B -1] of nametype; procedure MAKENULL ( var A: DICTIONARY );

var

i: integer; begin

for i := 0 to B - 1 do

A [i] := empty end; {MAKENULL}

function locate ( x: nametype; A: DICTIONARY ) : integer;

{ locate scans DICTIONARY from the bucket for h(x) until either x is found, or an empty bucket is found, or it has scanned completely around the table, thereby determining that the table does not contain x. locate returns the index of the bucket at which it stops for any of these reasons }

var

initial, i: integer;

{ initial holds h(x); i counts the number of buckets thus far scanned when looking for x }

begin

initial := h(x); i := 0;

while (i < B) and (A [(initial + i) mod B ] <> x) and

(A [(initial + i) mod B] <>

empty) do

i := i + 1;

return ((initial + i) mod B) end; { locate }

function locate 1 ( x: nametype; A: DICTIONARY ) : integer;

{ like locate, but it will also stop at and return a deleted entry } function MEMBER ( x: nametype; var A: DICTIONARY ) :

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (22 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

boolean;

begin

if A [locate(x)] = x then return (true)

else

return (false) end; { MEMBER }

procedure INSERT (x: nametype; var A: DICTIONARY ); var

bucket: integer;

begin

if A[locate (x)] = x then return; { x is already in A }

bucket := locate 1(x);

if (A [bucket] = empty) or (A [bucket] = deleted) then

A [bucket] := x

else

error('INSERT failed: table is full') end; { INSERT }

procedure DELETE ( x: nametype; var A: DICTIONARY ); var

bucket: integer; begin

bucket := locate(x); if A [bucket] = x then

A [bucket] := deleted end; { DELETE }

Fig. 4.14. Dictionary implementation by a closed hash table.

4.8 Estimating the Efficiency of Hash

Functions

As we have mentioned, hashing is an efficient way to represent dictionaries and some other abstract data types based on sets. In this section we shall examine the average time per dictionary operation in an open hash table. If there are B buckets and N elements stored in the hash table, then the average bucket has N/B members, and we expect that an average INSERT, DELETE, or MEMBER operation will take O(1 + N/B) time. The constant 1 represents the time to find the bucket, and N/B the time to search the bucket. If we can choose B to be about N, this time becomes a constant per

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (23 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

operation. Thus the average time to insert, delete, or test an element for membership, assuming the element is equally likely to be hashed to any bucket, is a constant independent of N.

Suppose we are given a program written in some language like Pascal, and we wish to insert all identifiers appearing in the program into a hash table. Whenever a declaration of a new identifier is encountered, the identifier is inserted into the hash table after checking that it is not already there. During this phase it is reasonable to assume that an identifier is equally likely to be hashed into any given bucket. Thus we can construct a hash table with N elements in time O(N(1 + N/B)). By choosing B equal to N this becomes O(N).

In the next phase identifiers are encountered in the body of the program. We must locate the identifiers in the hash table to retrieve information associated with them. But what is the expected time to locate an identifier? If the element searched for is equally likely to be any element in the table, then the expected time to search for it is just the average time spent inserting an element. To see this, observe that in searching once for each element in the table, the time spent is exactly the same as the time spent inserting each element, assuming that elements are always appended to the end of the list for the appropriate bucket. Thus the expected time for a search is also 0(1 + N/B).

The above analysis assumes that a hash function distributes elements uniformly over the buckets. Do such functions exist? We can regard a function such as that of Fig. 4.11 (convert characters to integers, sum and take the remainder modulo B) as a typical hash function. The following example examines the performance of this hash function.

Example 4.5. Suppose we use the hash function of Fig. 4.11 to hash the 100 keys consisting of the character strings A0, A1, . . . , A99 into a table of 100 buckets. On the assumption that ord(O), ord(1), . . . , ord(9) form an arithmetic progression, as they do in the common character codes ASCII and EBCDIC, it is easy to check that the keys hash into at most 29 of the 100 buckets,and the largest bucket contains A18, A27, A36, . . . , A90, or 9 of the 100 elements. If we compute the average number of steps for insertion, using the fact that the insertion of the ith element into a bucket takes i + 1 steps, we get 395 steps for these 100 keys. In comparison, our estimate N(1 + N/B) would suggest 200 steps.

The simple hashing function of Fig. 4.11 may treat certain sets of inputs, such as the consecutive strings in Example 4.5, in a nonrandom manner. "More random" hash functions are available. As an example, we can use the idea of squaring and taking middle digits. Thus, if we have a 5-digit number n as a key and square it, we get a 9 or 10 digit number. The "middle digits," such as the 4th through 7th places from the

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (24 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

right, have values that depend on almost all the digits of n. For example, digit 4 depends on all but the leftmost digit of n, and digit 5 depends on all digits of n. Thus, if B = 100, we might choose to take digits 6 and 5 to form the bucket number.

This idea can be generalized to situations where B is not a power of 10. Suppose keys are integers in the range 0, 1, . . . , K. If we pick an integer C such that BC2 is about equal to K2, then the function

h(n) = [n2/C] mod B

effectively extracts a base-B digit from the middle of n2.

Example 4.6. If K = 1000 and B = 8, we might choose C = 354. Then

To use the "square and take the middle" strategy when keys are character strings, first group the characters in the string from the right into blocks of a fixed number of characters, say 4, padding the left with blanks, if necessary. Treat each block as a single integer, formed by concatenating the binary codes for the characters. For example, ASCII uses a 7-bit character code, so characters may be viewed as "digits" in base 27, or 128. Thus we can regard the character string abcd as the integer (128)3a + (128)2b + (128)c + d. After converting all blocks to integers, add themand proceed as we have suggested previously for integers.

Analysis of Closed Hashing

In a closed hashing scheme the speed of insertion and other operations depends not only on how randomly the hash function distributes the elements into buckets, but also on how well the rehashing strategy avoids additional collisions when a bucket is already filled. For example, the linear strategy for resolving collisions is not as good as possible. While the analysis is beyond the scope of this book, we can observe the following. As soon as a few consecutive buckets are filled, any key that hashes to one of them will be sent, by the rehash strategy, to the end of the group, thereby increasing the length of that consecutive group. We are thus likely to find more long runs of consecutive buckets that are filled than if elements filled buckets at random. Moreover, runs of filled blocks cause some long sequences of tries before a newly inserted element finds an empty bucket, so having unusually large runs of filled

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (25 of 52) [1.7.2001 19:04:14]

Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

buckets slows down insertion and other operations.

We might wish to know how many tries (or probes) are necessary on the average to insert an element when N out of B buckets are filled, assuming all combinations of N out of B buckets are equally likely to be filled. It is generally assumed, although not proved, that no closed hashing strategy can give a better average time performance than this for the dictionary operations. We shall then derive a formula for the cost of insertion if the alternative locations used by the rehashing strategy are chosen at random. Finally, we shall consider some strategies for rehashing that approximate random behavior.

The probability of a collision on our initial probe is N/B. Assuming a collision, our first rehash will try one of B - 1 buckets, of which N - 1 are filled, so the probability

of at least two collisions is Similarly, the probability of at least i collisions is

If B and N are large, this probability approximates (N/B)i. The average number of probes is one (for the successful insertion) plus the sum over all i ³ 1 of the

probability of at least i collisions, that is, approximately . It can be shown that the exact value of the summation when formula (4.3) is substituted for

(N/B)i is , so our approximation is a good one except when N is very close to

B.

Observe that grows very slowly as N begins to grow from 0 to B - 1, the largest value of N for which another insertion is possible. For example, if N is half B, then about two probes are needed for the next insertion, on the average. The average

insertion cost per bucket to fill M of the B buckets is , which is

approximately , or Thus, to fill the table completely (M = B) requires an average of logeB per bucket, or B logeB probes in total. However, to fill the table to 90% of capacity (M = .9B) only requires B((10/9) logel0) or approximately 2.56B probes.

The average cost of the membership test for a nonexistent element is the same as

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1204.htm (26 of 52) [1.7.2001 19:04:14]

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]