Cooper K.Engineering a compiler
.pdf3.3. TOP-DOWN PARSING |
69 |
arrange the non-terminals in some order
A1, A2, . . . , An
for i ← 1 to n
for j ← 1 to i-1
replace each production of the form Ai→Aj γ with the productions
Ai→δ1γ | δ2γ | . . . | δk γ, where Aj →δ1 | δ2 | . . . | δk are all the current Aj productions.
eliminate any immediate left recursion on Ai using the direct transformation
Figure 3.2: Removal of left recursion
The transformation shown above eliminates immediate left recursion. Left recursion can occur indirectly, when a chain of rules such as α→β, β→γ, and γ→αδ combines to create the situation that α→+αδ. This indirect left recursion is not always obvious; it can be hidden through an arbitrarily long chain of productions.
To convert these indirect left recursions into right recursion, we need a more systematic approach than inspection followed by application of our transformation. Figure 3.2 shows an algorithm that achieves this goal. It assumes that the grammar has no cycles (A →+ A) or productions (A → ).
The algorithm works by imposing an arbitrary order on the non-terminals. The outer loop cycles through the non-terminals in this order, while the inner loop ensures that a production expanding Ai has no non-terminal Aj with j < i. When it encounters such a non-terminal, it forward substitutes the non-terminal away. This eventually converts each indirect left recursion into a direct left recursion. The final step in the outer loop converts any direct recursion on Ai to right recursion using the simple transformation shown earlier. Because new non-terminals are added at the end of the order and only involve right recursion, the loop can ignore them—they do not need to be checked and converted.
Considering the loop invariant for the outer loop may make this more clear. At the start of the ith outer loop iteration
k < i, a production expanding Ak with Al in its rhs, for l < k.
At the end of this process, (i = n), all indirect left recursion has been eliminated through the repetitive application of the inner loop, and all immediate left recursion has been eliminated in the final step of each iteration.
3.3.3Predictive Parsing
When we parsed x − 2 × y with the right-recursive expression grammar, we did not need to backtrack. In fact, we can devise a parser for the right-recursive
70 CHAPTER 3. PARSING
expression grammar that never needs to backtrack. To see this, consider how the parser makes a decision that it must retract through backtracking.
The critical choice occurs when the parser selects a production with which to expand the lower fringe of the partially constructed parse tree. When it tries to expand some non-terminal α, it must pick a rule α→β. The algorithm, as shown in Figure 3.1, picks that rule arbitrarily. If, however, the parser could always pick the appropriate rule, it could avoid backtracking.
In the right-recursive variant of the expression grammar, the parser can make the correct choice by comparing the next word in the input stream against the right-hand sides. Look at the situation that arose in the derivation of x − 2 × y in the previous section. When the parser state was in the state
|
Rule or |
|
|
|
|
|
|
|
|
|
Action |
Sentential Form |
|
Input |
|
|
|
|
|
8 |
Id Expr |
|
x |
↑− |
2 |
× |
y |
||
|
|
|
|||||||
it needed to choose an expansion for Expr . |
|
The possible right-hand sides |
were: + Term Expr , − Term Expr , and . Since the next word in the input stream was −, the second choice is the only one that can succeed. The first choice generates a leading +, so it can never match and will lead directly to backtracking. Choosing can only match the end of string, since Expr can only occur at the right end of a sentential form.
Fortuitously, this grammar has a form where the parser can predict the correct expansion by comparing the possible right-hand sides against the next word in the input stream. We say that such a grammar is predictive; parsers built on this property are called predictive parsers.
Before going further, we should define the property that makes a grammar predictively parsable. For any two productions, A → α | β, the set of initial terminal symbols derivable from α must be distinct from those derivable from β. If we define First(α) as the set of tokens that can appear as the first symbol in some string derived from α, then we want
First(α) ∩ First(β) =
For an entire grammar G, the desired property is
rules A→α1 | α2 | α3 | · · ·αn in a grammar G
First(α1) ∩ First(α2) ∩ First(α3) ∩ · · ·First(αn) =
If this property holds true for each non-terminal in the grammar, then the grammar can be parsed predictively. Unfortunately, not all grammars are predictive.1 Consider, for example, a slightly more realistic version of our on-going example. A natural extension to our right recursive expression grammar would replace the production Factor→Identifier with a set of productions that describe the syntax for scalar variable references, array variable references, and
function calls.
1This condition is also called the LL(1) condition. Any grammar that meets this condition can be used to construct a table-driven, LL(1) parser.
3.3. TOP-DOWN PARSING |
71 |
For each N T A G
find the longest prefix α common to two or more right-hand sides
if α = then
replace the rules expanding A,
A → αβ1 | αβ2 | · · · | αβn | γ with
A → α fie | γ
fie → β1 | β2 | · · · | βn and add fie to N T
Repeat until no common prefixes remain.
Figure 3.3: Left Factoring a Grammar
11.Factor → Identifier
12.| Identifier [ Exprlist ]
13.| Identifier ( Exprlist )
20. Exprlist → Expr , Exprlist
21.| Expr
Here, we show the C syntactic convention of using parentheses for function calls and square brackets for array references. This grammar fragment is not predictive, because the initial terminal symbol generated in rules 11, 12, and 13 is the same. Thus, the parser, in trying to expand a Factor on the parse tree’s lower fringe, cannot decide between 11, 12, and 13 on the basis of a single word lookahead. Of course, looking ahead two tokens would allow it to predict the correct expansion.
We can rewrite rules 11, 12, and 13 to make them predictive.
11. |
Factor |
→ |
Identifier arguments |
12. |
arguments |
→ |
[ Exprlist ] |
13.| ( Exprlist )
14. |
| |
|
In this case, we were able to transform the grammar into a predictive grammar. In essence, we introduced a new non-terminal to represent the common prefix of the three rules that were not predictive. We can apply this transformation, systematically and mechanically, to an arbitrary grammar. Figure 3.3 shows a simple algorithm that does this. However, left factoring the grammar will not always produce a predictive grammar.
Not all languages can be expressed in a predictive grammar. Using leftrecursion elimination and left factoring, we may be able to transform a grammar to the point where it can be predictively parsed. In general, however, it is undecidable whether or not a predictive grammar exists for an arbitrary contextfree language. For example, the language
72 |
CHAPTER 3. PARSING |
|
{an0bn | n ≥ 1} {an1b2n | n ≥ 1} |
has no predictive grammar.
3.3.4Top-Down Recursive Descent Parsers
Given a predictive grammar G, we can construct a hand-coded parser for G that operates by recursive descent. A recursive descent parser is structured as a set of mutually recursive routines, one for each non-terminal in the grammar. The routine for non-terminal A recognizes an instance of A in the input stream. To accomplish this, the routine invokes other routines to recognize the various non-terminals on A’s right-hand side.
Consider a set of productions A → β1 | β2 | β3. Since G is predictive, the parser can select the appropriate right hand side (one of β1, β2, or β3) using only the input token and the First sets. Thus, the code in the routine for A should have the form:
/* find an A */
if (current token First(β1)) find a β1 & return true
else if (current token First(β2)) find a β2 & return true
else if (current token First(β3)) find a β3 & return true
else {
report an error based on A and current token return false
}
For each right hand side A → βi, the routine needs to recognize each term in βi . The code must check for each term, in order.
•For a terminal symbol, the code compares the input symbol against the specified terminal. If they match, it advances the input token and checks the next symbol on the right hand side. If a mismatch occurs, the parser should report the syntax error in a suitably informative message.
•For a non-terminal symbol, the code invokes the routine that recognizes that non-terminal. That routine either returns true as an indication of success, or it reports an error to the user and returns false. Success allows the parser to continue, recognizing the next symbol on the right hand side, if any more exist. A return value of false forces the routine to return false to its calling context.
For a right hand side β1 = γδρ, with γ, ρ N T and δ T , the code needs to recognize a γ, a δ, and an ρ (abstracted away as “find a β1 and return true” in the previous code fragment). This code might look like:
3.4. BOTTOM-UP PARSING |
73 |
if (current token First(β1)) { if (Parse γ() = false)
return false
else if (current token = δ) {
report an error finding δ in A → γδρ return false
}
current token ← next token() if (Parse ρ() = false)
return false else
return true
}
The routine Parse A will contain a code fragment like this for each alternative right-hand side for A.
To construct a complete recursive descent parser, then, the strategy is clear. For each non-terminal, we construct a routine that recognizes that non-terminal. Each routine relies on the other routines to recognize non-terminals, and directly tests the terminal symbols that arise in its own right hand sides. Figure 3.4 shows a top-down recursive descent parser for the predictive grammar that we derived in Section 3.3.2. Notice that it repeats code for similar right hand sides. For example, the code in ExprP() under the tests for ‘+’ and for ‘-’ could be combined to produce a smaller parser.
Automating the Process Top-down recursive-descent parsing is usually considered a technique for hand coding a parser. Of course, we could build a parser generator that automatically emitted top-down recursive descent parsers for suitable grammars. The parser generator would first construct the necessary First sets for each grammar symbol, check each non-terminal to ensure that the First sets of its alternative right-hand sides are disjoint, and emit a suitable parsing routine for each non-terminal symbol in the grammar. The resulting parser would have the advantages of top-down recursive-descent parsers, such as speed, code-space locality, and good error detection. It would also have the advantages of a grammar-generated system, such as a concise, high-level specification and a reduced implementation e ort.
3.4Bottom-up Parsing
To parse a stream of words bottom-up, the parser begins by creating leaves in the parse tree for each word. This creates the base of the parse tree. To construct a derivation, it adds layers of non-terminals on top of the leaves, in a structure dictated by both the grammar and the input stream. The upper edge of this partially-constructed parse tree is called its upper frontier. Each layer extends the frontier upward, toward the tree’s root.
To do this, the parser repeatedly looks for a part of that upper frontier that matches the right-hand side of a production in the grammar. When it finds a
74
Main()
token ← next token(); if (Expr() = false)
then
next compilation step else return false
Expr()
result ← true
if (Term() = false) then result ← false
else if (EPrime() = false) then result ← false
return result
EPrime() result ← true
if (token = ’+’) then token ← next token() if (Term() = false)
then result ← false else if (EPrime() = false)
then result ← false else if (token = ’-’) then token ← next token()
if (Term() = false) then result ← false
else if (EPrime() = false) then result ← false
else result ← true /* */
return result
Term()
result ← true
if (Factor = false) then result ← false
else if (TPrime() = false) then result ← false
return result
CHAPTER 3. PARSING
TPrime() result ← true
if (token = ×) then token ← next token() if (Factor() = false)
then result ← false else if (TPrime() = false)
then result ← false else if (token = ÷) then
token ← next token() if (Factor() = false)
then result ← false else if (TPrime() = false)
then result ← false; else result ← true /* */
return result
Factor()
result ← true
if (token = ’(’) then token ← next token() if (Expr() = false)
then result ← false else if (token = ’)’)
then
report syntax error result ← false
else token ← next token() else if (token = Number)
then token ← next token() else if (token = identifier)
then token ← next token() else
report syntax error result ← false
return result
Figure 3.4: Recursive descent parser for expressions
3.4. BOTTOM-UP PARSING |
75 |
Digression: Predictive parsers versus DFAs
Predictive parsing is the natural extension of dfa-style reasoning to parsers. A dfa makes its transition based on the next input character. A predictive parser requires that the expansions be uniquely determined by the next word in the input stream. Thus, at each non-terminal in the grammar, there must be a unique mapping from the first token in any acceptable input string to a specific production that leads to a derivation for that string. The real di erence in power between a dfa and a predictively-parsable, or ll(1), grammar, derives from the fact that one prediction may lead to a right-hand side with many symbols, whereas, in a regular grammar, it predicts only a single symbol. This lets predictive grammars include productions like
p → ( p )
which is beyond the power of a regular expression to describe. (Recall that a regular expression can recognize (+ Σ )+, but this does not specify that the opening and closing parentheses must match.)
Of course, a hand-constructed top-down, recursive-descent parser can use arbitrary tricks to disambiguate production choices. For example, if a particular left-hand side cannot be predicted with a single symbol lookahead, the parser could use two symbols. Done judiciously, this should not cause problems.
match, it builds a node to represent the non-terminal symbol on production’s left-hand side, and adds edges to the nodes representing elements of the righthand side. This extends the upper frontier. This process terminates under one of two conditions.
1.The parser reduces the upper frontier to a single node that represents the grammar’s start symbol. If all the input has been consumed, the input stream is a valid sentence in the language.
2.No match can be found. Since, the parser has been unable to build a derivation for the input stream, the input is not a valid sentence. The parser should issue an appropriate error message.
A successful parse runs through every step of the derivation. A failed parse halts when it can find no further steps, at which point it can use the context accumulated in the tree to produce a meaningful error message.
Derivations begin with the goal symbol and work towards a sentence. Because a bottom-up parser proceeds bottom-up in the parse tree, it discovers derivation steps in reverse order. Consider a production α→β where β T . A bottom-up parser will “discover” the derivation step α→β before it discovers the step that derives α. Thus, if a derivation consists of a series of steps that produces a series of sentential forms
S0 = γ0 γ1 γ2 · · · γn−1 γn = sentence,
76 |
CHAPTER 3. PARSING |
the bottom-up parser will discover γn−1 γn before it discovers γn−2 γn−1. (It must add the nodes implied by γn−1 γn to the frontier before it can discover any matches that involve those nodes. Thus, it cannot discover the nodes in an order inconsistent with the reverse derivation.) At each point, the parser will operate on the frontier of the partially constructed parse tree; the current frontier is a prefix of the corresponding sentential form in the derivation. Because the sentential form occurs in a rightmost derivation, the missing su x consists entirely of terminal symbols.
Because the scanner considers the words in the input stream in a left-to-right order, the parser should look at the leaves from left to right. This suggests a derivation order that produces terminals from right to left, so that its reverse order matches the scanner’s behavior. This leads, rather naturally, to bottom-up parsers that construct, in reverse, a rightmost derivation.
In this section, we consider a specific class of bottom-up parsers called lr(1) parsers. Lr(1) parsers scan the input from left-to-right, the order in which scanners return classified words. Lr(1) parsers build a rightmost derivation, in reverse. Lr(1) parsers make decisions, at each step in the parse, based on the history of parse so far, and, at most, a lookahead of one symbol. The name lr(1) derives from these three properties: left-to-right scan, reverse-rightmost derivation, and 1 symbol of lookahead.2 Informally, we will say that a language has the lr(1) property if it can be parsed in a single left-to-right scan, to build a reverse rightmost derivation, using only one symbol of lookahead to determine parsing actions.
3.4.1Using handles
The key to bottom-up parsing lies in using an e cient mechanism to discover matches along the tree’s current upper frontier. Formally, the parser must find some substring βγδ of the upper frontier where
1.βγδ is the right-hand side of some production α → βγδ, and
2.α → βγδ is one step in the rightmost derivation of the input stream.
It must accomplish this while looking no more than one word beyond the right end of βγδ.
We can represent each potential match as a pair α→βγδ, k , where α→βγδ is a production in G and k is the position on the tree’s current frontier of the right end of δ. If replacing the occurrence of βγδ that ends at k with α is the next step in the reverse rightmost derivation of the input string, then α → βγδ, k is a handle of the bottom-up parse. A handle concisely specifies the next step in building the reverse rightmost derivation.
A bottom-up parser operates by repeatedly locating a handle on the frontier of the current partial parse tree, and performing the reduction that it specifies.
2The theory of lr parsing defines a family of parsing techniques, the lr(k) parsers, for arbitrary k ≥ 0. Here, k denotes the amount of lookahead that the parser needs for decision making. Lr(1) parsers accept the same set of languages as lr(k) parsers.
3.4. BOTTOM-UP PARSING |
|
77 |
||||
|
|
|
|
|
|
|
|
|
Token |
Frontier |
Handle |
Action |
|
|
|
|
|
|
|
|
|
1. |
Id |
|
— none — |
extend |
|
|
2. |
− |
Id |
Factor→Id,1 |
reduce |
|
|
3. |
− |
Factor |
Term→Factor,1 |
reduce |
|
|
4. |
− |
Term |
Expr→Term |
reduce |
|
|
5. |
− |
Expr |
— none — |
extend |
|
|
6. |
Num |
Expr − |
— none — |
extend |
|
|
7. |
× |
Expr − Num |
Factor→Num,3 |
reduce |
|
|
8. |
× |
Expr − Factor |
Term→Factor |
reduce |
|
|
9. |
× |
Expr − Term |
— none — |
extend |
|
|
10. |
Id |
Expr − Term × |
— none — |
extend |
|
|
11. |
eof |
Expr − Term × Id |
Factor→Id |
reduce |
|
|
12. |
eof |
Expr − Term × Factor |
Term→Term × Factor |
reduce |
|
|
13. |
eof |
Expr − Term |
Expr→Expr − Term |
reduce |
|
|
14. |
eof |
Expr |
Goal→Expr |
reduce |
|
|
15. |
eof |
Goal |
— none — |
accept |
|
Figure 3.5: States of the bottom-up parser on x − 2 × y
When the frontier does not contain a handle, the parser extends the frontier by adding a single non-terminal to the right end of the frontier. To see how this works, consider parsing the string x − 2 × y using the classic expression grammar. At each step, the parser either finds a handle on the frontier, or it adds to the frontier. The state of the parser, at each step, is summarized in Figure 3.5, while Figure 3.6 shows the corresponding partial parse tree for each step in the process; the trees are drawn with their frontier elements justified along the top of each drawing.
As the example shows, the parser only needs to examine the upper frontier of partially constructed parse tree. Using this fact, we can build a particularly clean form of bottom-up parser, called a shift-reduce parser. These parsers use a stack to hold the frontier; this simplifies the algorithm in two ways. First, the stack trivializes the problem of managing space for the frontier. To extend the frontier, the parser simply shifts the current input symbol onto the stack. Second, it ensures that all handles occur with their right end at the top of the stack; this eliminates any overhead from tracking handle positions.
Figure 3.7 shows a simple shift-reduce parser. To begin, it shifts an invalid symbol onto the stack and gets the first input symbol from the scanner. Then, it follows the handle-pruning discipline: it shifts symbols from the input onto the stack until it discovers a handle, and it reduces handles as soon as they are found. It halts when it has reduced the stack to Goal on top of invalid and consumed all the input.
78
2.Id
3.Factor
?
Id
4. Term
?
Factor
?
Id
5. Expr
?
Term
?
Factor
?
Id
6. Expr −
?
Term
?
Factor
?
Id
7. Expr − Num
?
Term
?
Factor
?
Id
8. Expr − Factor
??
Term Num
?
Factor
?
Id
9. Expr − Term
??
Term Factor
??
Factor Num
?
Id
CHAPTER 3. PARSING
10. Expr − Term ×
??
Term Factor
??
Factor Num
?
Id
11. Expr − Term × Id
??
Term Factor
??
Factor Num
?
Id
12. Expr − Term × Factor
?? ?
Term |
Factor |
Id |
??
Factor Num
?
Id |
|
|
|
13. Expr − |
PP |
||
|
|
Term |
|
|
? |
|
? qP |
Term |
Term × Factor |
?? ?
Factor |
Factor |
Id |
|||
|
|
? |
|
? |
|
|
|
||||
Id |
Num |
|
14.Expr
|
|
|
XX |
|||
|
|
|||||
|
|
|
? X |
|||
Expr |
− |
XzTerm |
||||
|
|
|
PP |
|||
|
? |
|
|
|
|
? qP |
Term |
|
|
|
Term × Factor |
?? ?
Factor |
Factor |
Id |
|||
|
|
? |
|
? |
|
|
|
||||
Id |
Num |
|
15.Goal
|
|
|
? |
|
|
|
|
|
Expr |
|
|
|
|
|
|
XX |
XXzTerm |
|||
|
|
? |
||||
Expr |
− |
|
|
PP |
||
|
|
|
||||
|
? |
|
|
|
|
? qP |
Term |
|
Term × Factor |
?? ?
Factor |
Factor |
Id |
? ?
Id Num
Figure 3.6: Bottom-up parse of x − 2 × y