Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
49
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

3.3. TOP-DOWN PARSING

69

arrange the non-terminals in some order

A1, A2, . . . , An

for i ← 1 to n

for j ← 1 to i-1

replace each production of the form AiAj γ with the productions

Aiδ1γ | δ2γ | . . . | δk γ, where Aj δ1 | δ2 | . . . | δk are all the current Aj productions.

eliminate any immediate left recursion on Ai using the direct transformation

Figure 3.2: Removal of left recursion

The transformation shown above eliminates immediate left recursion. Left recursion can occur indirectly, when a chain of rules such as αβ, βγ, and γαδ combines to create the situation that α+αδ. This indirect left recursion is not always obvious; it can be hidden through an arbitrarily long chain of productions.

To convert these indirect left recursions into right recursion, we need a more systematic approach than inspection followed by application of our transformation. Figure 3.2 shows an algorithm that achieves this goal. It assumes that the grammar has no cycles (A + A) or productions (A ).

The algorithm works by imposing an arbitrary order on the non-terminals. The outer loop cycles through the non-terminals in this order, while the inner loop ensures that a production expanding Ai has no non-terminal Aj with j < i. When it encounters such a non-terminal, it forward substitutes the non-terminal away. This eventually converts each indirect left recursion into a direct left recursion. The final step in the outer loop converts any direct recursion on Ai to right recursion using the simple transformation shown earlier. Because new non-terminals are added at the end of the order and only involve right recursion, the loop can ignore them—they do not need to be checked and converted.

Considering the loop invariant for the outer loop may make this more clear. At the start of the ith outer loop iteration

k < i, a production expanding Ak with Al in its rhs, for l < k.

At the end of this process, (i = n), all indirect left recursion has been eliminated through the repetitive application of the inner loop, and all immediate left recursion has been eliminated in the final step of each iteration.

3.3.3Predictive Parsing

When we parsed x 2 × y with the right-recursive expression grammar, we did not need to backtrack. In fact, we can devise a parser for the right-recursive

70 CHAPTER 3. PARSING

expression grammar that never needs to backtrack. To see this, consider how the parser makes a decision that it must retract through backtracking.

The critical choice occurs when the parser selects a production with which to expand the lower fringe of the partially constructed parse tree. When it tries to expand some non-terminal α, it must pick a rule αβ. The algorithm, as shown in Figure 3.1, picks that rule arbitrarily. If, however, the parser could always pick the appropriate rule, it could avoid backtracking.

In the right-recursive variant of the expression grammar, the parser can make the correct choice by comparing the next word in the input stream against the right-hand sides. Look at the situation that arose in the derivation of x 2 × y in the previous section. When the parser state was in the state

 

Rule or

 

 

 

 

 

 

 

 

 

Action

Sentential Form

 

Input

 

 

 

 

8

Id Expr

 

x

↑−

2

×

y

 

 

 

it needed to choose an expansion for Expr .

 

The possible right-hand sides

were: + Term Expr , − Term Expr , and . Since the next word in the input stream was , the second choice is the only one that can succeed. The first choice generates a leading +, so it can never match and will lead directly to backtracking. Choosing can only match the end of string, since Expr can only occur at the right end of a sentential form.

Fortuitously, this grammar has a form where the parser can predict the correct expansion by comparing the possible right-hand sides against the next word in the input stream. We say that such a grammar is predictive; parsers built on this property are called predictive parsers.

Before going further, we should define the property that makes a grammar predictively parsable. For any two productions, A α | β, the set of initial terminal symbols derivable from α must be distinct from those derivable from β. If we define First(α) as the set of tokens that can appear as the first symbol in some string derived from α, then we want

First(α) First(β) =

For an entire grammar G, the desired property is

rules Aα1 | α2 | α3 | · · ·αn in a grammar G

First(α1) First(α2) First(α3) ∩ · · ·First(αn) =

If this property holds true for each non-terminal in the grammar, then the grammar can be parsed predictively. Unfortunately, not all grammars are predictive.1 Consider, for example, a slightly more realistic version of our on-going example. A natural extension to our right recursive expression grammar would replace the production FactorIdentifier with a set of productions that describe the syntax for scalar variable references, array variable references, and

function calls.

1This condition is also called the LL(1) condition. Any grammar that meets this condition can be used to construct a table-driven, LL(1) parser.

3.3. TOP-DOWN PARSING

71

For each N T A G

find the longest prefix α common to two or more right-hand sides

if α = then

replace the rules expanding A,

A αβ1 | αβ2 | · · · | αβn | γ with

A α fie | γ

fie β1 | β2 | · · · | βn and add fie to N T

Repeat until no common prefixes remain.

Figure 3.3: Left Factoring a Grammar

11.Factor Identifier

12.| Identifier [ Exprlist ]

13.| Identifier ( Exprlist )

20. Exprlist Expr , Exprlist

21.| Expr

Here, we show the C syntactic convention of using parentheses for function calls and square brackets for array references. This grammar fragment is not predictive, because the initial terminal symbol generated in rules 11, 12, and 13 is the same. Thus, the parser, in trying to expand a Factor on the parse tree’s lower fringe, cannot decide between 11, 12, and 13 on the basis of a single word lookahead. Of course, looking ahead two tokens would allow it to predict the correct expansion.

We can rewrite rules 11, 12, and 13 to make them predictive.

11.

Factor

Identifier arguments

12.

arguments

[ Exprlist ]

13.| ( Exprlist )

14.

|

 

In this case, we were able to transform the grammar into a predictive grammar. In essence, we introduced a new non-terminal to represent the common prefix of the three rules that were not predictive. We can apply this transformation, systematically and mechanically, to an arbitrary grammar. Figure 3.3 shows a simple algorithm that does this. However, left factoring the grammar will not always produce a predictive grammar.

Not all languages can be expressed in a predictive grammar. Using leftrecursion elimination and left factoring, we may be able to transform a grammar to the point where it can be predictively parsed. In general, however, it is undecidable whether or not a predictive grammar exists for an arbitrary contextfree language. For example, the language

72

CHAPTER 3. PARSING

 

{an0bn | n ≥ 1} {an1b2n | n ≥ 1}

has no predictive grammar.

3.3.4Top-Down Recursive Descent Parsers

Given a predictive grammar G, we can construct a hand-coded parser for G that operates by recursive descent. A recursive descent parser is structured as a set of mutually recursive routines, one for each non-terminal in the grammar. The routine for non-terminal A recognizes an instance of A in the input stream. To accomplish this, the routine invokes other routines to recognize the various non-terminals on A’s right-hand side.

Consider a set of productions A → β1 | β2 | β3. Since G is predictive, the parser can select the appropriate right hand side (one of β1, β2, or β3) using only the input token and the First sets. Thus, the code in the routine for A should have the form:

/* find an A */

if (current token First(β1)) find a β1 & return true

else if (current token First(β2)) find a β2 & return true

else if (current token First(β3)) find a β3 & return true

else {

report an error based on A and current token return false

}

For each right hand side A → βi, the routine needs to recognize each term in βi . The code must check for each term, in order.

For a terminal symbol, the code compares the input symbol against the specified terminal. If they match, it advances the input token and checks the next symbol on the right hand side. If a mismatch occurs, the parser should report the syntax error in a suitably informative message.

For a non-terminal symbol, the code invokes the routine that recognizes that non-terminal. That routine either returns true as an indication of success, or it reports an error to the user and returns false. Success allows the parser to continue, recognizing the next symbol on the right hand side, if any more exist. A return value of false forces the routine to return false to its calling context.

For a right hand side β1 = γδρ, with γ, ρ N T and δ T , the code needs to recognize a γ, a δ, and an ρ (abstracted away as “find a β1 and return true” in the previous code fragment). This code might look like:

3.4. BOTTOM-UP PARSING

73

if (current token First(β1)) { if (Parse γ() = false)

return false

else if (current token = δ) {

report an error finding δ in A → γδρ return false

}

current token next token() if (Parse ρ() = false)

return false else

return true

}

The routine Parse A will contain a code fragment like this for each alternative right-hand side for A.

To construct a complete recursive descent parser, then, the strategy is clear. For each non-terminal, we construct a routine that recognizes that non-terminal. Each routine relies on the other routines to recognize non-terminals, and directly tests the terminal symbols that arise in its own right hand sides. Figure 3.4 shows a top-down recursive descent parser for the predictive grammar that we derived in Section 3.3.2. Notice that it repeats code for similar right hand sides. For example, the code in ExprP() under the tests for ‘+’ and for ‘-’ could be combined to produce a smaller parser.

Automating the Process Top-down recursive-descent parsing is usually considered a technique for hand coding a parser. Of course, we could build a parser generator that automatically emitted top-down recursive descent parsers for suitable grammars. The parser generator would first construct the necessary First sets for each grammar symbol, check each non-terminal to ensure that the First sets of its alternative right-hand sides are disjoint, and emit a suitable parsing routine for each non-terminal symbol in the grammar. The resulting parser would have the advantages of top-down recursive-descent parsers, such as speed, code-space locality, and good error detection. It would also have the advantages of a grammar-generated system, such as a concise, high-level specification and a reduced implementation e ort.

3.4Bottom-up Parsing

To parse a stream of words bottom-up, the parser begins by creating leaves in the parse tree for each word. This creates the base of the parse tree. To construct a derivation, it adds layers of non-terminals on top of the leaves, in a structure dictated by both the grammar and the input stream. The upper edge of this partially-constructed parse tree is called its upper frontier. Each layer extends the frontier upward, toward the tree’s root.

To do this, the parser repeatedly looks for a part of that upper frontier that matches the right-hand side of a production in the grammar. When it finds a

74

Main()

token next token(); if (Expr() = false)

then

next compilation step else return false

Expr()

result true

if (Term() = false) then result false

else if (EPrime() = false) then result false

return result

EPrime() result true

if (token = ’+’) then token next token() if (Term() = false)

then result false else if (EPrime() = false)

then result false else if (token = ’-’) then token next token()

if (Term() = false) then result false

else if (EPrime() = false) then result false

else result true /* */

return result

Term()

result true

if (Factor = false) then result false

else if (TPrime() = false) then result false

return result

CHAPTER 3. PARSING

TPrime() result true

if (token = ×) then token next token() if (Factor() = false)

then result false else if (TPrime() = false)

then result false else if (token = ÷) then

token next token() if (Factor() = false)

then result false else if (TPrime() = false)

then result false; else result true /* */

return result

Factor()

result true

if (token = ’(’) then token next token() if (Expr() = false)

then result false else if (token = ’)’)

then

report syntax error result false

else token next token() else if (token = Number)

then token next token() else if (token = identifier)

then token next token() else

report syntax error result false

return result

Figure 3.4: Recursive descent parser for expressions

3.4. BOTTOM-UP PARSING

75

Digression: Predictive parsers versus DFAs

Predictive parsing is the natural extension of dfa-style reasoning to parsers. A dfa makes its transition based on the next input character. A predictive parser requires that the expansions be uniquely determined by the next word in the input stream. Thus, at each non-terminal in the grammar, there must be a unique mapping from the first token in any acceptable input string to a specific production that leads to a derivation for that string. The real di erence in power between a dfa and a predictively-parsable, or ll(1), grammar, derives from the fact that one prediction may lead to a right-hand side with many symbols, whereas, in a regular grammar, it predicts only a single symbol. This lets predictive grammars include productions like

p ( p )

which is beyond the power of a regular expression to describe. (Recall that a regular expression can recognize (+ Σ )+, but this does not specify that the opening and closing parentheses must match.)

Of course, a hand-constructed top-down, recursive-descent parser can use arbitrary tricks to disambiguate production choices. For example, if a particular left-hand side cannot be predicted with a single symbol lookahead, the parser could use two symbols. Done judiciously, this should not cause problems.

match, it builds a node to represent the non-terminal symbol on production’s left-hand side, and adds edges to the nodes representing elements of the righthand side. This extends the upper frontier. This process terminates under one of two conditions.

1.The parser reduces the upper frontier to a single node that represents the grammar’s start symbol. If all the input has been consumed, the input stream is a valid sentence in the language.

2.No match can be found. Since, the parser has been unable to build a derivation for the input stream, the input is not a valid sentence. The parser should issue an appropriate error message.

A successful parse runs through every step of the derivation. A failed parse halts when it can find no further steps, at which point it can use the context accumulated in the tree to produce a meaningful error message.

Derivations begin with the goal symbol and work towards a sentence. Because a bottom-up parser proceeds bottom-up in the parse tree, it discovers derivation steps in reverse order. Consider a production αβ where β T . A bottom-up parser will “discover” the derivation step αβ before it discovers the step that derives α. Thus, if a derivation consists of a series of steps that produces a series of sentential forms

S0 = γ0 γ1 γ2 · · · γn−1 γn = sentence,

76

CHAPTER 3. PARSING

the bottom-up parser will discover γn−1 γn before it discovers γn−2 γn−1. (It must add the nodes implied by γn−1 γn to the frontier before it can discover any matches that involve those nodes. Thus, it cannot discover the nodes in an order inconsistent with the reverse derivation.) At each point, the parser will operate on the frontier of the partially constructed parse tree; the current frontier is a prefix of the corresponding sentential form in the derivation. Because the sentential form occurs in a rightmost derivation, the missing su x consists entirely of terminal symbols.

Because the scanner considers the words in the input stream in a left-to-right order, the parser should look at the leaves from left to right. This suggests a derivation order that produces terminals from right to left, so that its reverse order matches the scanner’s behavior. This leads, rather naturally, to bottom-up parsers that construct, in reverse, a rightmost derivation.

In this section, we consider a specific class of bottom-up parsers called lr(1) parsers. Lr(1) parsers scan the input from left-to-right, the order in which scanners return classified words. Lr(1) parsers build a rightmost derivation, in reverse. Lr(1) parsers make decisions, at each step in the parse, based on the history of parse so far, and, at most, a lookahead of one symbol. The name lr(1) derives from these three properties: left-to-right scan, reverse-rightmost derivation, and 1 symbol of lookahead.2 Informally, we will say that a language has the lr(1) property if it can be parsed in a single left-to-right scan, to build a reverse rightmost derivation, using only one symbol of lookahead to determine parsing actions.

3.4.1Using handles

The key to bottom-up parsing lies in using an e cient mechanism to discover matches along the tree’s current upper frontier. Formally, the parser must find some substring βγδ of the upper frontier where

1.βγδ is the right-hand side of some production α → βγδ, and

2.α → βγδ is one step in the rightmost derivation of the input stream.

It must accomplish this while looking no more than one word beyond the right end of βγδ.

We can represent each potential match as a pair αβγδ, k , where αβγδ is a production in G and k is the position on the tree’s current frontier of the right end of δ. If replacing the occurrence of βγδ that ends at k with α is the next step in the reverse rightmost derivation of the input string, then α → βγδ, k is a handle of the bottom-up parse. A handle concisely specifies the next step in building the reverse rightmost derivation.

A bottom-up parser operates by repeatedly locating a handle on the frontier of the current partial parse tree, and performing the reduction that it specifies.

2The theory of lr parsing defines a family of parsing techniques, the lr(k) parsers, for arbitrary k ≥ 0. Here, k denotes the amount of lookahead that the parser needs for decision making. Lr(1) parsers accept the same set of languages as lr(k) parsers.

3.4. BOTTOM-UP PARSING

 

77

 

 

 

 

 

 

 

 

 

Token

Frontier

Handle

Action

 

 

 

 

 

 

 

 

 

1.

Id

 

— none —

extend

 

 

2.

Id

FactorId,1

reduce

 

 

3.

Factor

TermFactor,1

reduce

 

 

4.

Term

ExprTerm

reduce

 

 

5.

Expr

— none —

extend

 

 

6.

Num

Expr −

— none —

extend

 

 

7.

×

Expr − Num

FactorNum,3

reduce

 

 

8.

×

Expr − Factor

TermFactor

reduce

 

 

9.

×

Expr − Term

— none —

extend

 

 

10.

Id

Expr − Term ×

— none —

extend

 

 

11.

eof

Expr − Term × Id

FactorId

reduce

 

 

12.

eof

Expr − Term × Factor

TermTerm × Factor

reduce

 

 

13.

eof

Expr − Term

ExprExpr − Term

reduce

 

 

14.

eof

Expr

GoalExpr

reduce

 

 

15.

eof

Goal

— none —

accept

 

Figure 3.5: States of the bottom-up parser on x 2 × y

When the frontier does not contain a handle, the parser extends the frontier by adding a single non-terminal to the right end of the frontier. To see how this works, consider parsing the string x 2 × y using the classic expression grammar. At each step, the parser either finds a handle on the frontier, or it adds to the frontier. The state of the parser, at each step, is summarized in Figure 3.5, while Figure 3.6 shows the corresponding partial parse tree for each step in the process; the trees are drawn with their frontier elements justified along the top of each drawing.

As the example shows, the parser only needs to examine the upper frontier of partially constructed parse tree. Using this fact, we can build a particularly clean form of bottom-up parser, called a shift-reduce parser. These parsers use a stack to hold the frontier; this simplifies the algorithm in two ways. First, the stack trivializes the problem of managing space for the frontier. To extend the frontier, the parser simply shifts the current input symbol onto the stack. Second, it ensures that all handles occur with their right end at the top of the stack; this eliminates any overhead from tracking handle positions.

Figure 3.7 shows a simple shift-reduce parser. To begin, it shifts an invalid symbol onto the stack and gets the first input symbol from the scanner. Then, it follows the handle-pruning discipline: it shifts symbols from the input onto the stack until it discovers a handle, and it reduces handles as soon as they are found. It halts when it has reduced the stack to Goal on top of invalid and consumed all the input.

78

2.Id

3.Factor

?

Id

4. Term

?

Factor

?

Id

5. Expr

?

Term

?

Factor

?

Id

6. Expr −

?

Term

?

Factor

?

Id

7. Expr − Num

?

Term

?

Factor

?

Id

8. Expr − Factor

??

Term Num

?

Factor

?

Id

9. Expr − Term

??

Term Factor

??

Factor Num

?

Id

CHAPTER 3. PARSING

10. Expr − Term ×

??

Term Factor

??

Factor Num

?

Id

11. Expr − Term × Id

??

Term Factor

??

Factor Num

?

Id

12. Expr − Term × Factor

?? ?

Term

Factor

Id

??

Factor Num

?

Id

 

 

13. Expr −

PP

 

 

Term

 

?

 

? qP

Term

Term × Factor

?? ?

Factor

Factor

Id

 

 

?

 

?

 

 

 

Id

Num

 

14.Expr

 

 

 

XX

 

 

 

 

 

? X

Expr

XzTerm

 

 

 

PP

 

?

 

 

 

 

? qP

Term

 

 

 

Term × Factor

?? ?

Factor

Factor

Id

 

 

?

 

?

 

 

 

Id

Num

 

15.Goal

 

 

 

?

 

 

 

 

 

Expr

 

 

 

 

 

XX

XXzTerm

 

 

?

Expr

 

 

PP

 

 

 

 

?

 

 

 

 

? qP

Term

 

Term × Factor

?? ?

Factor

Factor

Id

? ?

Id Num

Figure 3.6: Bottom-up parse of x 2 × y

Соседние файлы в предмете Электротехника