Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
49
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

Appendix B

Data Structures

B.1 Introduction

Crafting a successful compiler requires attention to many details. This appendix explores minor design details that have an impact on a compiler’s implementation. In most cases, the details would burden the discussion in the body of the book. Thus, we have gathered them together into this appendix.

B.2 Representing Sets

B.3 Implementing Intermediate Forms

B.4 Implementing Hash-tables

The two central problems in hash-table implementation are ensuring that the hash function produces a good distribution of integers (at any of the table sizes that will be used), and providing conflict resolution in an e cient way.

Finding good hash functions is di cult. Fortunately, hashing has been in use long enough that many good functions have been described in the literature. Section B.4.1 describes two related approaches that, in practice, produce good results.

The problem of collision handling drives the layout of hash tables. Many collision-resolution schemes have been proposed. We will discuss the two most widely used schemes. Section B.4.2 describes open hashing, sometimes called a bucket hash, while Section B.4.3 presents an alternative scheme called open addressing, or rehashing. Section B.4.5 shows how to incorporate the mechanisms for lexical scoping into these schemes. The final section deals with a practical issue that arises in a compiler development environment—frequent changes to the hash-table definition.

341

342

APPENDIX B. DATA STRUCTURES

Digression: Organizing a Search Table

In designing a symbol table, the first decision that the compiler writer faces concerns the organization of the table and its searching algorithm. As in many other applications, the compiler writer has several choices.

Linear List: A linear list can expand to arbitrary size. The search algorithm is a single, small, tight loop. Unfortunately, the search algorithm requires O(n) probes per lookup, on average. This single disadvantage probably outweighs simplicity of implementation and expansion. To justify using a linear list, the compiler writer needs strong evidence that the procedures being compiled have very few names.

Binary Search: To retain the easy expansion of the linear list while improving search time, the compiler writer might use a balanced binary

tree. Ideally, a binary tree should allow lookup in O(log2 n) probes per lookup; this is a considerable improvement over the linear list. Many algorithms have been published for balancing search trees [27, 12]. (Similar e ects can be achieved by using a binary search of an ordered table, but the table makes insertion and expansion more di cult.)

Hash Table: A hash table may minimize access costs. The implementation computes a table index directly from the name. As long as that computation produces a good distribution of indices, the average access cost should be O(1). Worst-case, however, can devolve to linear search. The compiler writer can take steps to decrease the likelihood of this happening, but pathological cases may still occur. Many hash-table implementations have inexpensive schemes for expansion.

Multi-set Discrimination: To avoid worst-case behavior, the compiler writer can use an o -line technique called multiset discrimination. It creates a unique index for each identifier, at the cost of an extra pass over the source text. This technique avoids the possibility of pathological behavior that always exists with hashing. (See the digression “An Alternative to Hashing” on page 156 for more details.)

Of these organizations, the most common choice appears to be the hash table. It provides better compile-time behavior than the linear list or binary tree, and the implementation techniques have been widely studied and taught.

B.4. IMPLEMENTING HASH-TABLES

343

B.4.1 Choosing a Hash Function

The importance of a good hash function cannot be overemphasized. A hash function that produces a bad distribution of index values directly increases the average cost of inserting items into the table and finding such items later. Fortunately, many good hash functions have been documented in the literature. The multiplicative hash function, described by Knuth and Cormen Cormen et al. [38, 27] works well and has e cient implementations. The notion of a universal hash function described by Cormen et al. [27] leads to a related hash function, but incorporates randomization to avoid pathological behavior.

Multiplicative Hash Function The multiplicative hash function is deceptively simple. The programmer chooses a single constant, C, and use it in the following formula

h(key) = TableSize((C · key) mod 1)

where C is the constant, key is the integer being used as a key into the table, and TableSize is, rather obviously, the current size of the hash table. Knuth suggests computing C as an integer constant A divided by the wordsize of the computer. In our experience, C should be between The e ect of the function is to compute C · key, take its fractional part with the mod function, and multiply the result by the size of the table.

Universal Hash Functions To implement a universal hash function, the programmer designs a family of functions that can be parameterized by a small set of constants. At execution time, a set of values for the constants are chosen at random—either using random numbers for the constants or selecting a random index into a set of previously tested constants. (The same constants are used throughout a single execution of the program that uses the hash function, but the constants vary from execution to execution.) By varying the hash function on each execution of the program, a universal hash function produces di erent distributions on each run of the program. In a compiler, if the input program produced pathological behavior in some particular compile, it should not produce the same behavior in subsequent compiles.

To implement a universal version of the multiplicative hash function, the programmer has two choices. The simplest strategy is to randomly generate an integer A at the beginning of execution, form Aw , and use that as C in the computation. The other alternative is to pick a set of ten to twenty integer constants A, encode them in a table, and, at run-time, select one constant at random for use in computing C (as Aw ). The advantage of the latter method is that the programmer can screen constants for good behavior. In the former approach, the random number generator directly changes the distribution of keys across the table. In the latter approach, the random number generator only needs to be good enough to pick an integer index into the table—the keys in the table have been pre-selected for reasonable behavior.

344

APPENDIX B. DATA STRUCTURES

Digression: The Perils of Poor Hash Functions

The choice of a hash function has a critical impact on the cost of table insertions and lookups. This is a case where a small amount of attention can make a large di erence.

Many years ago, we saw a student implement the following hash function for character strings: (1) break the key in four byte chunks, (2) exclusive-or them together, and (3) take the resulting number, modulo table size, as the index. The function is relatively fast. It has a straight forward implementation. With some table sizes, it produces adequate distributions.

When the student inserted this implementation into a system that performed source-to-source translation on Fortran programs, several independent facts combined to create an algorithmic disaster. First, the implementation language padded character strings with blanks to reach a four-byte boundary. Second, the student chose an initial table size of 2048. Finally, Fortran programmers use many one and two character variable names, such as i, j, k, x, y, and z.

All the short variable names fit in a single word string, so no exclusive-or is needed. However, taking x mod 2048 simply masks out the final eleven bits of x. Thus, all short variable names produce the same index—the final bits of the string consisting of two blanks. Thus, the table devolved rapidly into a linear search. While this particular hash function is far from ideal, simply changing the table size to 2047 eliminated the most noticeable e ects.

B.4.2 Open Hashing

Open hashing, also called bucket hashing, assumes that the hash function h will produce collisions. It relies on h to partition the set of input keys into a fixed number of sets, or buckets. Each bucket contains a linear list of records, one record per name. LookUp(n) walks the linear list stored in the bucket indexed by h(n) to find n. Thus, LookUp requires one evaluation of h(n) and the traversal of a linear list. Evaluating h(n) should be fast; the list traversal will take time proportional to the length of the list. For a table of size S, with N names, the cost per lookup should be roughly O( NS ). As long as h distributes names fairly uniformly and the ratio of buckets to names is kept small, this cost approximates our goal: O(1) time for each access.

Figure B.1 shows a small hash table implemented with this scheme. It assumes that h(a) = h(d) = 3 to create a collision. Thus, a and d occupy the same slot in the table. The list structure links them together and reveals that a was entered before d; after all, Insert should add to the front of the list rather than the end of the list.

Open hashing has several advantages. Because it creates a new node in one of the linked lists for every inserted name, it can handle an arbitrarily large number of names without running out of space. An excessive number of entries in one bucket does not a ect the cost of access in other buckets. Because the concrete representation for the buckets is usually an array of pointers, the overhead for

B.4. IMPLEMENTING HASH-TABLES

345

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

-

b

 

 

 

 

2

 

 

 

 

 

 

 

13

 

 

 

 

 

 

 

 

-

d

-

a

 

 

 

 

 

 

 

 

 

 

 

 

 

 

h(d)

4

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9

 

-

c

 

Figure B.1: Open hashing table

increasing S is small—one pointer for each added bucket. (This makes it less expensive to keep NS small. The cost per name is constant.) Choosing S as a power of two reduces the cost of the inevitable mod operation required to implement h.

The primary drawbacks for open hashing relate directly to these advantages. Both can be managed.

1.Open hashing can be allocation intensive. Each insertion allocates a new record. When implemented on a system with heavyweight memory allocation, this may be noticeable. Using a lighter-weight mechanism, such as arena-based allocation (see the digression on 195) can alleviate this problem.

2.If any particular set gets large, LookUp degrades into linear search. With a reasonably behaved hash function, this only occurs when N S. The implementation should detect this problem and enlarge the array of buckets. Typically, this involves allocating a new array of buckets and re-inserting each entry from the old table into the new table.

A well implemented open hash table provides e cient access with a low overhead in both space and time.

To improve the behavior of the linear search performed inside a single bucket, the compiler can dynamically reorder the chain. Rivest and Tarjan showed that the appropriate strategy is to move a node up the chain by one position on each lookup.

B.4.3 Open Addressing

Open addressing, also called rehashing, handles collisions by computing an alternative index for the names whose normal slot, at h(n), is already occupied. In this scheme, LookUp(n) computes h(n) and examines that slot. If it is empty, LookUp fails. If it finds n, LookUp succeeds. If it finds a name other than h(n), it uses a second function g(n) to compute an increment for the search and probes the table at (h(n)+g(n)) mod S, (h(n)+2×g(n))modS, (h(n)+3×g(n))modS, and so on, until it either finds n, finds an empty slot, or returns to h(n) a second time. If it finds an empty slot, or it returns to h(n) a second time, LookUp fails.

346

 

APPENDIX B. DATA STRUCTURES

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

b

 

 

 

 

2

 

 

 

 

13

 

 

 

 

a

 

 

rehash

 

 

 

 

 

h(d)

4

 

 

 

 

 

 

5

d

 

 

 

 

 

 

6

 

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9

c

 

 

 

Figure B.2: Open Addressing Table

Figure B.2 shows a small hash table implemented with this scheme. It uses the same data as the table shown in Figure B.1. As before, h(a) = h(d) = 3, while h(b) = 1 and h(c) = 9. When d was inserted, it produced a collision with a. The secondary hash function g(d) produced 2, so Insert placed d at index 5 in the table. In e ect, open addressing builds chains of items similar to those used in open hashing. In open addressing, however, the chains are stored directly in the table, and a single table location can serve as the starting point for multiple chains, each with a di erent increment produced by g.

This scheme makes a subtle tradeo of space against speed. Since each key is stored in the table, S must be larger than N . If collisions are infrequent, because h and g produce good distributions, then the rehash chains stay short and access costs stay low. Because it can recompute g inexpensively, it need not store pointers to form the rehash chains—a savings of N pointers. This saved space goes into making the table larger; the larger table improves behavior by lowering the collision frequency. The primary advantage of open addressing is simple: lower access costs through shorter rehash chains.

Open addressing has two primary drawbacks. Both arise as N approaches S and the table becomes full.

1.Because rehash chains thread through the index table, a collision between n and m can induce interfere with a subsequent insertion of p. If h(n) = h(m) and (h(m) + g(m)) mod S = h(p), then inserting n, followed by m fills p’s slot in the table. When the scheme behaves well, this problem has a minor impact. As N approaches S, it can become pronounced.

2.Because S must be at least as large as N , the table must be expanded if N grows too large. (Similarly, the implementation may expand S when some chain becomes too long.) Expansion is needed for correctness; with open hashing, it is a matter of e ciency.

Some implementations use a constant function for g. This simplifies the implementation and reduces the cost of computing secondary indices. However, it creates a single rehash chain for each value of h and it has the e ect of merging rehash chains whenever a secondary index encounters an already occupied table slot. These two disadvantages outweigh the cost of evaluating a second hash

B.4. IMPLEMENTING HASH-TABLES

347

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

H

 

 

 

 

 

 

 

 

2

 

 

HHH

 

 

 

 

 

 

 

 

 

 

 

 

13

 

 

 

H

 

 

 

 

 

@

 

 

 

 

 

 

h(c)

 

4

@

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

@

 

 

 

next

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

jH

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7

 

 

 

@ H:

c

slot

 

 

 

 

 

@

 

 

 

 

 

8

 

 

b

 

 

 

 

 

 

 

R@

 

 

 

 

 

9

a

 

 

 

 

 

 

 

 

 

 

 

 

 

Index set

 

 

Stack

Figure B.3: Stack allocation for records

function. A more reasonable choice is to use two multiplicative hash functions with di erent constants—selected randomly at startup from a table of constants, if possible.

The table size, S, plays an important role in open addressing. LookUp must recognize when it reaches a table slot that it has already visited; otherwise, it will not halt on failure. To make this e cient, the implementation should ensure that it eventually returns to h(n). If S is a prime number, then any choice of 0 < g(n) < S generates a series of probes, p1, p2, . . . , pS with the property that p1 = pS = h(n) and pi = h(n), 1 < i < S. That is, LookUp will examine every slot in the table before it returns to h(n). Since the implementation may need to expand the table, it should include a table of appropriately-sized prime numbers. A small set of primes will su ces, due to the realistic limits on both program size and memory available to the compiler.

B.4.4 Storing Symbol Records

Neither open hashing nor open addressing directly addresses the issue of how to allocate space for the recorded information associated with each hash table key. With open hashing, the temptation is to allocate the records directly in the nodes that implement the chains. With open addressing, the temptation is to avoid the pointers and make each entry in the index table be a symbol record. Both these approaches have drawbacks. We may achieve better results by using a separately allocated stack to hold the records.

Figure B.3 depicts this implementation. In an open hashing implementation, the chain lists themselves can be implemented on the stack. This lowers the cost of allocating individual records—particularly if allocation is a heavyweight operation. In an open addressing implementation, the rehash chains are still implicit in the index set, preserving the space saving that motivated the idea.

In this design, the actual records form a dense table, better for external i/o. With heavyweight allocation, this scheme amortizes the cost of a large allocation over many records. With a garbage collector, it decreases the number of objects that must be marked and collected. In either case, having a dense table makes it more e cient to iterate over all of the symbols in the table—an operation

348

 

 

 

 

-

 

 

 

-

 

 

 

 

 

 

 

 

 

hh((ca)) ,

 

 

 

 

c,2

 

 

 

 

 

h(b)

-

 

 

-

 

b,0

 

 

 

 

 

 

 

 

h(x)

-

 

 

-

 

x,2

 

 

 

h(z)

-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

APPENDIX B. DATA STRUCTURES

-a,1 - c,0 - a,0

-x,1

Figure B.4: Lexical Scoping in an Open Hashing Table

that the compiler will use to perform tasks such as assigning storage locations. As a final advantage, this scheme drastically simplifies the task of expanding the index set. To expand the index set, the compiler discards the old index set, allocates a larger set, and then reinserts the records into the new table, working from the bottom of the stack to the top. This eliminates the need to have, temporarily, both the old and new table in memory. Iterating over the dense table takes less work, in general, than chasing the pointers to traverse the lists in open hashing. It avoids iterating over empty table slots, as can happen when

open addressing expands the index set to keep chain lengths short.

The compiler need not allocate the entire stack as a single object. Instead, the stack can be implemented as a chain of nodes that each hold k records. When a node is full, the implementation allocates a new node, adds it to the end of the chain, and continues. This provides the compiler writer with finegrain control over the tradeo between allocation cost and wasted space.

B.4.5 Adding Nested Lexical Scopes

Section 6.7.3 described the issues that arise in creating a symbol table to handle nested lexical scopes. It described a simple implementation that created a sheaf of symbol tables, one per level. While that implementation is conceptually clean, it pushes the overhead of scoping into LookUp, rather than into

InitializeScope, FinalizeScope, and Insert. Since the compiler invokes

LookUp many more times than it invokes these other routines, other implementations deserve consideration.

Consider again the code in Figure 6.9. It generates the following actions:

a,0 b,0 c,0 b,1 z,1 ↓ ↑ a,1 x,1 c,2 , x,2 ↓ ↓ ↓

where shows a call to InitializeScope(), is a call to FinalizeScope(), and a pair name,n is a call to Insert that adds name at level n.

Adding Lexical Scopes to Open Hashing Consider what would happen in an open-hashing table if we simply added a lexical-level field to the record for each name, and inserted new names at the front of its chain. Insert could check for duplicates by comparing both names and lexical levels. LookUp would return the first record that it discovered for a given name. InitializeScope simply

B.4. IMPLEMENTING HASH-TABLES

 

349

h(a)

-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

@

 

 

 

 

 

 

 

 

h(c) ,

 

 

 

 

 

 

 

tos

 

 

 

 

 

 

 

 

 

@

 

 

 

 

h(b)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-

 

 

 

S @@

--

x,2

 

 

 

 

 

 

 

 

 

 

 

c,2

 

 

 

h(x)

-

 

 

 

S

 

 

 

 

x,1

 

 

 

 

 

 

SSS

-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a,1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

h(z)

-

 

 

 

 

 

c,0

 

 

 

 

 

 

 

 

 

 

 

 

 

b,0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a,0

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure B.5: Lexical Scoping in a Stack Allocated Open Hashing Table

bumps an internal counter for the current lexical level. This scheme pushes the complications into FinalizeScope, which must not only decrement the current lexical level, but also must remove the records for any names inserted in this scope.

If open hashing is implemented with individually allocated nodes for its chains, as shown in Figure B.1, then FinalizeScope must find each record for the scope being discarded and remove them from their respective chains. If they will not be used later in the compiler, it must deallocate them; otherwise, it must chain them together to preserve them. Figure B.4 shows the table that this approach would produce, at the assignment statement in Figure 6.9.

With stack allocated records, FinalizeScope can iterate from the top of the stack downward until it reaches a record for some level below the level being discarded. For each record, it updates the index set entry with the record’s pointer to the next item on the chain. If the records are being discarded, FinalizeScope resets the pointer to the next available slot; otherwise, the records are preserved, together, on the stack. Figure B.5 shows the symbol table for our example, at the assignment statement.

With a little care, dynamic reordering of the chain can be added to this scheme. Since FinalizeScope() uses the stack ordering, rather than the chain ordering, it will still find all the top-level names at the top of the stack. With reordered chains, the compiler either needs to walk the chain to remove each deleted name record, or to doubly-link the chains to allow for quicker deletion.

Adding Lexical Scopes to Open Addressing With an open addressing table, the situation is slightly more complex. Slots in the table are a critical resource, when all the slots are filled, the table must be expanded before further insertion can occur. Deletion from a table that uses rehashing is di cult; the implementation cannot easily tell if the deleted record falls in the middle of some rehash chain. Thus, marking the slot empty breaks any chain that passes through that location (rather than ending there). This argues against storing discrete records for each variant of a name in the table. Instead, the compiler should link only one record per name into the table; it can create a chain of superseded records for older variants. Figure B.6 depicts this situation for the continuing example.

This scheme pushes most of the complexity into Insert and FinalizeScope.

350 APPENDIX B. DATA STRUCTURES

h(a)

-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

h(c)

,

 

 

rehashA

 

 

 

 

 

tos

 

 

 

 

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

h(b)

-

 

 

 

 

HHA

-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x,2

 

 

 

 

 

 

 

 

H

 

 

 

 

 

 

 

 

S AH

-

 

c,2

 

 

h(x)

-

 

 

 

 

SS

AA

 

 

 

 

 

 

-

x,1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SS -

 

a,1

 

 

h(z)

-

 

 

 

 

 

 

 

c,0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

b,0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a,0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure B.6: Lexical Scoping in an Open Addressing Table

If Insert creates a new record on the top of the stack. If it finds an older declaration of the same in the index set, it replaces that reference with a reference to the new record, and links the older reference into the new record. FinalizeScope iterates over the top of the stack, as in open hashing. To remove a record that has an older variant, it simply relinks the index set to point at the older variant. To remove the final variant of a name, it must insert a reference to a speciallydesignated record that denotes a deleted reference. LookUp must recognize the deleted reference as occupying a slot in the current chain. Insert must know that it can replace a deleted reference with any newly inserted symbol.

This scheme, in essence, creates separate chains for collisions and for redeclarations. Collisions are threaded through the index set. Redeclarations are threaded through the stack. This should reduce the cost of LookUp slightly, since it avoids examining more than one record name for any single name.

Consider a bucket in open hashing that contains seven declarations for x and a single declaration for y at level zero. LookUp might encounter all seven records for x before finding y. With the open addressing scheme, LookUp encounters one record for x and one record for y.

B.5 Symbol Tables for Development Environments

Most compilers use a symbol table as a central repository for information about the various names that arise in the source code, in the ir, and in the generated code. During compiler development, the set of fields in the symbol table seems to grow monotonically. Fields are added to support new passes. Fields are added to communicate information between passes. When the need for a field disappears, it may or may not be removed from the symbol table definition. As each field is added, the symbol table swells in size and any parts of the compiler with direct access to the symbol table must be recompiled.

We encountered this problem in the implementation the Rn and ParaScope Programming Environments. The experimental nature of these systems led to a situation where additions and deletions of symbol table fields were common. To address the problem, we implemented more complex, but more flexible structure for the symbol table—a two-dimensional hash table. This eliminated almost all changes to the symbol table definition and its implementation.

Соседние файлы в предмете Электротехника