Topic 8 - Bottom-Up Translation

[4.5] - Bottom-Up Parsing

Consider this grammar:

S --> a T U e
T --> T b c | b
U --> d

and the rightmost derivation of the sentence: a b b c d e:

S ==> a T U e
==> a T d e
==> a T b c d e
==> a b b c d e

As mentioned in section 4.2, a bottom-up parser is an LR parser so it reads the input from left-to-right and performs a rightmost derivation in reverse order. There are four steps in the rightmost derivation of a b b c d e so a bottom-up parser performs the steps in reverse order:

  1. The parser examines the sentence ( a b b c d e ) for substrings that match the right-sides of productions in the grammar. There are three cases: the first (b) in the sentence; the second (b) in the sentence; or the (d). The parser chooses the first b and reduces it to the left-side of the T --> b production to produce the sentential form: a T b c d e .

  2. The parser examines the sentential form ( a T b c d e ) for substrings that match the right-sides of productions in the grammar. There are three cases: ( T b c ), (b), and (d). The parser chooses ( T b c ) and reduces it to the left-side of the production: T --> T b c to produce the sentential form: a T d e.

  3. The parser examines the sentential form ( a T d e ) for substrings that match the right-sides of productions in the grammar and finds only one case: (d). The parser reduces it to the left-side of the production: U --> d to produce the sentential form: a T U e.

  4. The parser examines the sentential form ( a T U e ) for substrings that match the right-sides of productions in the grammar and finds that the only case is the whole string: ( a T U e ). The parser reduces it to the left-side of the production: S --> a T U e to produce a sentential form containing only the start symbol, S.
Note that each step applies a production in reverse, replacing the right-side with the left-side, so we use the word reduce instead of produce.

Handles: The substring of the sentential form that the parser chooses to reduce in each step of the parse is called the handle for that step. In the previous example the handles are:

  1. the first (b) in ( a b b c d e ).
  2. the ( T b c ) substring in ( a T b c d e ).
  3. the (d) in ( a T d e ).
  4. the whole string, ( a T U e ).
In step 1 and in step 2 of the example the parser has three possible handles to choose from: if the parser chooses the wrong handle it won't be able to complete the reverse-ordered rightmost derivation. The main task of a bottom-up parser is to choose the correct handle at each step of the parse. There could be many choices on any step; e.g., the empty string can be inserted into the string of n symbols in any of n + 1 different locations so just a single -production in a grammar will give us many possible handles to choose from.

Shift-Reduce Parsing: Most bottom-up parsers are implemented as shift-reduce parsers. Such a parser uses a stack to hold grammar symbols (it is convenient to think of a horizontal stack with its bottom on the left and its top on the right) and has four possible actions:

We use $ to mark the left-end (bottom) of the stack and also the end of the input string. Initially the stack is empty. Parsing ends successfully when the input is empty and the stack contains only the start symbol. As an example we use the following grammar:

E --> E + E
E --> E *E
E --> (E )
E --> id

Figure 4.22 shows the actions of a shift-reduce parser to parse the input string id1 + id2 * id3 according to the grammar. Here we parse id1 * ( id2 + id3 ):

StackInputAction
$ id1 * ( id2 + id3 ) $ shift
$ id1 * ( id2 + id3 ) $ E --> id
$ E * ( id2 + id3 ) $ shift
$ E * ( id2 + id3 ) $ shift
$ E * ( id2 + id3 ) $ shift
$ E * ( id2 + id3 ) $ E --> id
$ E * ( E + id3 ) $ shift
$ E * ( E + id3 ) $ shift
$ E * ( E + id3 ) $ E --> id
$ E * ( E + E ) $ E --> E + E
$ E * ( E ) $ shift
$ E * ( E ) $ E --> ( E )
$ E * E $ E --> E * E
$ E $ accept

Shift-reduce parsers can be constructed for a large class of grammars - the LR grammars - but the construction is usually so complicated that they are only constructed by parser-construction programs (see section 4.7.) However, the next section will show that there is a small but important class of grammars where shift-reduce parsers can be easily constructed by hand.

[4.6] - Operator-Precedence Parsing

If no production of a grammar has two or more adjacent nonterminals on its right-side and no production is an -production then one can easily hand-construct a shift-reduce parser for the grammar. Such a parser is called an operator-precedence parser.

The syntax of arithmetic expressions can usually be described with such a grammar:

E --> E + E
E --> E *E
E --> (E )
E --> - E
E --> id

This grammar has 6 terminals:

+ - * ( ) id

and to eliminate ambiguity we must establish precedence relations between certain pairs of terminals. On this Web page precedence relations are denoted as follows:

RelationMeaning
a << b terminal a yields precedence to terminal b
a == b terminal a has the same precedence as terminal b
a >> b terminal a takes precedence over terminal b

Note that precedence relations are only established between the terminals of the grammar (and with the $ markers at both ends of a string,) nonterminals are ignored. The customary precedence relations for the terminals of the foregoing grammar are shown in the following table (note that - is the unary minus operator, this grammar doesn't have a binary subtract operator:)

Operator-precedence Relations
id -(* +)$
id err err err >> >> >> >>
) err err err >> >> >> >>
- << << << >> >> >> >>
* << << << >> >> >> >>
+ << << << << >> >> >>
( << << << << << ==err
$ << << << << << erracc

Note that err entries in this table mark syntax errors and the acc entry marks the accept state when successful completion of parsing can be announced.

Since no production of the grammar has two or more adjacent nonterminals on its right-side and there are no -productions, there must always be one or more terminals between any pair of nonterminals in any sentential form. The precedence relation between two terminals holds whether or not there is a nonterminal between them. For example, in the sentential form: $ E * ( E + E ) $ there are the following precedence relations:

$ << * << ( << + >> ) >> $

The shift and reduce actions of the operator-precedence parser are governed by the precedence relations between the right-most terminal on the stack, a, and the current input symbol, b:

To illustrate the actions of an operator-precedence parser one can repeat the examples of shift-reduce parsing given in section 4.5 using the table of Operator-precedence Relations in this section to determine the parser actions.

Developing the Operator-Precedence Relations

Here we describe how the operator-precedence relations are developed.

Binary Operators: Let 1 and 2 be two infix binary operators:

Unary Operators: Unary operators usually have higher precedence than binary operators: -A+B usually means (-A)+B instead of -(A+B). Most unary operators are prefix operators (written before their operands) so they are right-associative: - - A is -(-A) ). Postfix unary operators are left-associative.

Leaves of Syntax Trees: The leaves of a syntax tree are ID tokens, numeric constants, and boolean constants. Leaves must be evaluated before their values can be used by the operators so they are given higher precedence than the operators.

Parentheses: For the parentheses we must have ( == ) so the handle of the E --> (E ) production can be found. A pair of parentheses can't be removed until all operations between the pair have been performed. A pair of outer parentheses can't be removed until all pairs of inner parentheses have been removed. An operation outside a pair of parentheses can't be performed until the pair has been removed. These rules dictate the operator-precedence relations in the following table (is any operator:)

( )
( << << ==
<< >>
) err >> >>

Note that it's usually a syntax error to have a ) ( combination with no intervening operator. The relations in this table can also be used for brackets and any other grouping operators.

Function Calls: In most languages the production for a function call is:

expression --> id ( expression_list )

where expression_list is one or more expressions separated by commas. The handle for this production includes zero or more commas between the parentheses so the relations in the following table are required:

id ( , )
id err == >> >>
( << << == ==
, << << == ==
) err err >> >>

Reading Array Elements: In many languages, the production for reading an element of an array is:

expression --> id [ expression_list ]

The precedence relations table for function calls can be used if its parentheses are replaced by brackets.

End Markers: The $-sign at the left-end of the stack is never in any handle so it should be << all following terminals. Similarly, the $-sign at the end of the input is never shifted on to the stack so all preceding terminals should be >> than it.

Precedence Functions

Suppose the grammar for an operator-precedence parser has n terminals so the table of operator-precedence relations has n + 1 rows and n + 1 columns (including the $). Often n is so large that coding and de-bugging the table of (n + 1)2 entries is a very difficult task. Fortunately, one can usually replace the table with a pair of precedence functions that are much easier to code and de-bug.

The idea is to try and define two functions, f and g, that map the terminals into integers such that:

  1. f (a) < g (b) whenever a << b;
  2. f (a) = g (b) whenever a = = b; and
  3. f (a) > g (b) whenever a >> b.
If f and g can be defined then the table of precedence relations is no longer needed: instead of referring to the table to determine the precedence relation between two terminals, a and b, the parser code compares f (a) with g (b).

Functions f and g are called precedence functions. Each function has only one argument with only n + 1 values so the two functions are much easier to code and de-bug than the large precedence table.

Note that there is some loss in error-detection capability when the precedence functions are used: they never notice any syntax errors whereas the table of precedence relations does report some of the errors. A syntax error won't be discovered until the parser tries to reduce a handle that doesn't match the right-side of any production.

Precedence functions f and g do exist for the precedence-relations table near the beginning of this section of the notes. As an example, the table is repeated below with values of the precedence functions in the f -column and the g-row shown next to the corresponding terminals:

g 6 664 210
f id -(* +)$
5id err err err >> >> >> >>
5) err err err >> >> >> >>
5- << << << >> >> >> >>
5* << << << >> >> >> >>
3+ << << << << >> >> >>
1( << << << << << ==err
1$ << << << << << erracc

For every <<, ==, and >> entry in the table one can see that the comparison of the corresponding f and g function values matches the entry. If the parser uses the precedence functions instead of the table it must have a separate test for the accept state and it won't notice a syntax error until it tries to reduce an illegal handle.

[5.3] - Bottom-Up Evaluation Of S-Attributed Definitions

Section 4.5 describes a shift-reduce parser: the parser shifts input symbols on to a stack until it finds a handle on the top of the stack which it then reduces by popping off the handle and pushing the left-side of the appropriate production on to the stack. As an example, assume that:

A --> X Y Z

is a production in the grammar and the stack contains:

Z <-- top
Y
X
. . .
$ <-- bottom

If the shift-reduce parser decides that XYZ is indeed a handle then it pops off Z, Y, and X, and pushes A on to the stack:

A <-- top
. . .
$ <-- bottom

In an S-attributed definition all attributes are synthesized. An attribute associated with a grammar symbol should remain associated with that grammar symbol when the symbol is stacked so we give each item on the stack two fields: one field holding a grammar symbol and the other field holding the synthesized attributes of that grammar symbol (or a pointer to them.)

Every time the shift-reduce parser performs a reduction it reads the attributes of the symbols popped off the stack, computes the attributes associated with the nonterminal on the left-side of the production, and places them in the attribute field of the item pushed on to the stack.

To continue the foregoing example, assume nonterminals A, X, Y, and Z, are associated with synthesized attributes A.a, X.x, Y.y, and Z.z, respectively, and assume there is a semantic rule associated with the production as follows:

A --> X Y Z { A.a := f (X.x, Y.y, Z.z ) ; }

Before the reduction the stack contains:

Z Z.z <-- top
Y Y.y
X X.x
. . . . . .
$ . . . <-- bottom

When it reduces XYZ to A the parser reads Z.z, Y.y, and X.x from the items popped off the stack, computes A.a = f (X.x, Y.y, Z.z ) and pushes one item on the stack containing A and A.a :

A A.a <-- top
. . . . . .
$ . . . <-- bottom

Operator-Precedence Parsing

An operator-precedence parser finds handles by evaluating the precedence relations between terminals on the stack. Evaluation of these relations is complicated by the fact that the parser must skip over any nonterminals it finds on the stack. Since every stacked nonterminal must be above a terminal (or the $-item marking the bottom of the stack) one can simplify precedence-relation evaluation by placing each nonterminal with the terminal below it. Each stacked item has four fields:

  1. a field to hold a terminal symbol;
  2. a field to hold the attributes of the terminal symbol;
  3. a field to hold the nonterminal that is stacked above the terminal (or NULL if no such nonterminal exists;) and
  4. a field to hold the attributes of the nonterminal (or NULL if the nonterminal doesn't exist.)
Often the grammar contains only one nonterminal (e.g., expression ) in which case there is no need for field-3 in the stacked items.

As an example, assume a grammar for expressions where mulop-operators have precedence over addop-operators, nonterminal E has an associated synthesized attribute, E.syn, and every terminal has an attribute (.attr) attached to it. Assume the grammar has the following production and semantic rule:

E --> E1 mulop E2 { E.syn := f ( E1.syn, mulop.attr, E2.syn ) ; }

Assume the parser stack contains:

$ ... addop E1 mulop E2

If each stacked item contains field-1, field-2, and field-4 then the stack looks like:

field-1field-2field-4
mulop mulop.attr E2.syn <-- top
addop addop.attr E1.syn
. . . . . . . . .
$ . . . . . . <-- bottom

Assume the next input symbol is addop or any other terminal with lower precedence than mulop. The parser can easily identify the handle (E mulop E ) by examining the precedence relations between the terminals in field-1 of the items on the top of the stack and between the topmost terminal and the input symbol. After it pops off the handle and evaluates E.syn = f ( E1.syn, mulop.attr, E2.syn ), the parser places E.syn in field-4 of the addop terminal:

field-1field-2field-4
addop addop.attr E.syn <-- top
. . . . . . . . .
$ . . . . . . <-- bottom

An S-Attributed Definition for Project 3

Here we show the semantic rules that should be associated with the productions in the grammar of coding project 3.

There is a single nonterminal, expr, in the grammar with a single attribute, expr.ptr, that points to the symbol table entry containing the lexeme and type-expression of the expr. The productions of the grammar are shown in the table below:

expr --> NUM{1}
expr --> BCONST{2}
expr --> LPAR expr1 RPAR{3}
expr --> NOTOP expr1 {4}
expr --> UNARYOP expr1 {5}
expr --> ID LBRK expr1 RBRK {6}
expr --> expr1 ANDOP expr2 {7}
expr --> expr1 OROP expr2 {8}
expr --> expr1 ADDOP expr2 {9}
expr --> expr1 RELOP expr2 {10}
expr --> ID{11}
expr --> ID LPAR expr_list RPAR{12}
expr --> expr1 MULOP expr2 {13}

Every appearance of the expr nonterminal on the right-side of a production is subscripted to distinguish it from the expr nonterminal on the left-side. The expr_list nonterminal in the right-side of production {12} is a list of one or more expr nonterminals separated by COMMA tokens. We assume that type-expressions use the format suggested here. The format of quadruples is shown here. The following items are labeled with numbers in braces that correspond to the numbered productions in the foregoing table: each item describes the semantic rules/actions that should be performed when the parser reduces by the corresponding production.

{1} Copy the pointer to the NUM-token entry into expr.ptr.

{2} Copy the pointer to the BCONST-token entry into expr.ptr.

{3} Copy expr1.ptr into expr.ptr.

{4} Check that GetType(expr1.ptr ) equals "b".

Call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and "b" as the type-expression.

Generate a NOT quadruple with GetLex(expr1.ptr ) as the source operand and newname as the result.

Copy the pointer to the new symbol table entry into expr.ptr.

{5} Check that GetType(expr1.ptr ) equals "i" or "r". If the lexeme of the UNARYOP token is the plus-sign then do nothing except copy expr1.ptr into expr.ptr.

Otherwise (if the lexeme of the UNARYOP token is the minus-sign) call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and GetType(expr1.ptr ) as the type-expression.

If the type-expression is "i" then generate a SUBI quadruple with "0" as the first source operand, GetLex(expr1.ptr ) as the second source operand, and newname as the result.

If the type-expression is "r" then generate a SUBR quadruple with "0.0" as the first source operand, GetLex(expr1.ptr ) as the second source operand, and newname as the result.

Finally, copy the pointer to the new symbol table entry into expr.ptr.

{6} Check that GetType(expr1.ptr ) equals "i". Check that the type-expression of the ID-token is "B", "I", or "R".

Call newtemp for the name of a new temporary variable, name1. Insert a new entry into the symbol table with name1 as the lexeme and "i" as the type-expression.

Generate a SUBI quadruple with GetLex(expr1.ptr ) as the first source field, the lexeme of the low-index of the array as the second source field, and name1 as the result.

Call newtemp for the name of another new temporary variable, name2. Insert a new entry into the symbol table with name2 as the lexeme and "b", "i", or "r" as the type-expression.

Generate an LDB, an LDI, or an LDR quadruple with name1 as the first source operand, the lexeme of the ID-token as the second source operand, and name2 as the result.

Copy the pointer to the symbol table entry of name2 into expr.ptr.

{7} Check that GetType(expr1.ptr ) and GetType(expr2.ptr ) equal "b".

Call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and "b" as the type-expression.

Generate an AND quadruple with GetLex(expr1.ptr ) as the first source operand, GetLex(expr2.ptr ) as the second source operand, and newname as the result.

Copy the pointer to the new symbol table entry into expr.ptr.

{8} Same as {7} except generate an OR quadruple instead of an AND quadruple.

{9}Copy expr1.ptr and expr2.ptr into temporary pointer variables, ptr1 and ptr2, respectively. Check that GetType(ptr1 ) equals "i" or "r". Check that GetType(ptr2 ) equals "i" or "r".

If GetType(ptr1 ) equals "i" and GetType(ptr2 ) equals "r" then the first source operand must be converted to a real. Do this by calling newtemp for the name of a new temporary variable, new1 ; inserting a new entry into the symbol table with new1 as the lexeme and "r" as the type-expression; generating a COPYI2R quadruple with GetLex(ptr1 ) as the source operand and new1 as the result; and then changing ptr1 to point to the new1 entry of the symbol table.

If GetType(ptr1 ) equals "r" and GetType(ptr2 ) equals "i" then the second source operand must be converted to a real. Do this by calling newtemp for the name of a new temporary variable, new2 ; inserting a new entry into the symbol table with new2 as the lexeme and "r" as the type-expression; generating a COPYI2R quadruple with GetLex(ptr2 ) as the source operand and new2 as the result; and then changing ptr2 to point to the new2 entry of the symbol table.

Both source operands are now of the same type so call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and GetType(ptr1 ) as the type-expression. Depending on the lexeme of the ADDOP token and GetType(ptr1 ) generate an ADDI, SUBI, ADDR, or SUBR quadruple with GetLex(ptr1 ) as the first source operand, GetLex(ptr2 ) as the second source operand, and newname as the result.

Copy the pointer to the newname -entry of the symbol table into expr.ptr.

{10}Copy expr1.ptr and expr2.ptr into temporary pointer variables, ptr1 and ptr2, respectively. Check that GetType(ptr1 ) equals "i" or "r". Check that GetType(ptr2 ) equals "i" or "r".

If GetType(ptr1 ) equals "i" and GetType(ptr2 ) equals "r" then the first source operand must be converted to a real. Do this by calling newtemp for the name of a new temporary variable, new1 ; inserting a new entry into the symbol table with new1 as the lexeme and "r" as the type-expression; generating a COPYI2R quadruple with GetLex(ptr1 ) as the source operand and new1 as the result; and then changing ptr1 to point to the new1 entry of the symbol table.

If GetType(ptr1 ) equals "r" and GetType(ptr2 ) equals "i" then the second source operand must be converted to a real. Do this by calling newtemp for the name of a new temporary variable, new2 ; inserting a new entry into the symbol table with new2 as the lexeme and "r" as the type-expression; generating a COPYI2R quadruple with GetLex(ptr2 ) as the source operand and new2 as the result; and then changing ptr2 to point to the new2 entry of the symbol table.

Both source operands are now of the same type so call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and "b" as the type-expression.

Copy the lexeme of the RELOP token into a temporary string variable, lex. If lex equals ">" or ">=" then exchange ptr1 with ptr2 and change lex to "<" or "<=", respectively.

Depending on lex and on GetType(ptr1 ) generate an LTI, LTEQI, EQI, NEQI, LTR, LTEQR, EQR, or NEQR quadruple with GetLex(ptr1 ) as the first source operand, GetLex(ptr2 ) as the second source operand, and newname as the result.

Copy the pointer to the newname -entry of the symbol table into expr.ptr.

{11}Check that the type-expression of the ID token equals either: "b", "i", "r", ">b", ">i", or ">r".

If the type-expression of the ID token equals either: "b", "i", or "r", then copy the pointer to the ID-token entry into expr.ptr.

If the type-expression of the ID token equals either: ">b", ">i", or ">r", then call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and "b", "i", or "r", as the type-expression, respectively. Generate a CALLB, CALLI, or CALLR quadruple with "0" in the first source field, the lexeme of the ID token in the second source field, and newname in the result field. Copy the pointer to the newname -entry of the symbol table into expr.ptr.

{12}Between the LPAR and RPAR tokens there should be a list of one or expr nonterminals separated by COMMA tokens.

Trace through the expression list from left-to-right and generate a PARAMB, PARAMI, or PARAMR quadruple for each expression in the list. If expri.ptr is the attribute for expression i in the list then use GetType(expri.ptr ) to select the quadruple operation and put GetLex(expri.ptr ) into the first source field of the quadruple.

While tracing through the expression list count the number of expressions in the list and concatenate their type-expressions into a single string.

Check the type-expression for the ID-token that it does contain a '>' character followed by 'b', 'i', or 'r', and that the string before the '>' character agrees with the string for the expression list.

Let last be the last character in the type-expression for the ID-token. Call newtemp for the name of a new temporary variable, newname. Insert a new entry into the symbol table with newname as the lexeme and last as the type-expression.

Depending on last generate a CALLB, CALLI, or CALLR quadruple with the number of expressions in the expression list in the first source field, the lexeme of the ID-token in the second source field, and newname as the result.

Copy the pointer to the newname -entry of the symbol table into expr.ptr.

{13}The four lexemes for the mulop token ("*", "/", "div", and "mod") have different type-checking rules.

The "*" lexeme has the same type-checking rules as the addop token so use the rules in {9} except generate a MULI or MULR quadruple.

The "/" lexeme is always a real division. If GetType(expr1.ptr ) equals "i" then generate a COPYI2R quadruple to convert expr1 to a real and store it in a new temporary variable. If GetType(expr2.ptr ) equals "i" then generate a COPYI2R quadruple to convert expr2 to a real and store it in a new temporary variable. Then generate a DIVR quadruple and copy the pointer to its real result into expr.ptr.

The "div" and "mod" lexemes are always integer operations and won't accept real source operands. Check to make sure that GetType(expr1.ptr ) and GetType(expr2.ptr ) equal "i". Then generate a DIVI or MOD quadruple and copy the pointer to its integer result into expr.ptr.

[4.7] - LR Parsers

A large class of grammars can be parsed using LR(k ) parsers: the "L" stands for left-to-right scanning of the input, the "R" stands for constructing a rightmost derivation in reverse, and k is the number of input symbols of lookahead used to making parsing decisions. When (k ) is omitted, k is assumed to equal 1. LR parsing has several advantages:

The main disadvantage of LR parsing is that it's too much work to construct a parser by hand: one needs a specialized tool - an LR parser generator.

The LR Parsing Algorithm

Figure 4.29 shows a block diagram of an LR parser: an input, an output, a stack, a driver program, and a parsing table with two parts (action and goto ). The driver program is the same for all LR parsers: it reads the input string one symbol at a time and maintains a stack of the form:

s0 X1 s1 X2 s2 . . . Xm-1 sm-1 Xm sm

where each Xi is a grammar symbol, each si is a state, and sm is on the top of the stack. The action of the driver program depends on action [ sm, ai ] where ai is the current input symbol:

  1. If action [ sm, ai ] = shift s, the parser shifts the input symbol, ai, onto the stack, and then stacks state s. The current input symbol is now ai+1.

  2. If action [ sm, ai ] = reduce A --> , the parser executes a reduce move using the A --> production of the grammar. If has r grammar symbols, first 2r symbols are popped off the stack (r state symbols and r grammar symbols) so the top of the stack is now sm-r, then A is pushed on the stack, and then state goto [ sm-r, A ] is pushed on the stack. The current input symbol is still ai.

  3. If action [ sm, ai ] = accept, parsing is completed.

  4. If action [ sm, ai ] = error, the parser has discovered a syntax error.
Example 4.33. Figure 4.32 illustrates the actions of an LR parser parsing the sentence, id * id + id, when the grammar has the following productions and parsing table:

(1) E --> E + T
(2) E --> T
(3) T --> T * F
(4) T --> F
(5) F --> ( E )
(6) F --> id
State action goto
id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2s7 r2r2
3 r4r4 r4r4
4 s5 s4 8 2 3
5 r6r6 r6r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1s7 r1r1
10 r3r3 r3r3
11 r5r5 r5r5
where si means shift and stack state i ,
rj means reduce by production numbered j ,
acc means accept, and
blank means error.

LR Grammars

An LR grammar is a grammar for which one can construct a parsing table. A grammar is LR if a left-to-right shift-reduce parser can recognize handles when they appear on the top of the stack.

An LR parser doesn't need to examine the entire stack for a handle, the state symbol on the top of the stack contains all the information it needs. It can also examine the next k input symbols to help make a decision. The cases k = 0 and k = 1 are of practical interest so we only consider those cases here.

Constructing SLR Parsing Tables

The text shows three methods of constructing LR parsing tables: simple LR or SLR is the simplest method but only works for the simplest of grammars.

An item of a grammar G is a production of G with a dot at some position of the right side; e.g., if T --> X Y Z is a production of G then four items of G are:

T --> . X Y Z
T --> X . Y Z
T --> X Y . Z
T --> X Y Z .

If G has an -production, T --> , then T --> . is an item of G. Intuitively, an item indicates how much of a production has been so far in a parsing process; e.g., T --> X . Y Z indicates that a string derivable from X has been seen so far on the input and we hope to see a string derivable from Y Z next on the input.

If G is a grammar with start symbol S, then G ', the augmented grammar of G, is G with a new start symbol S ' and production S ' --> S. The purpose of augmenting a grammar is to indicate to the parser when it has reached the accept state: the accept state occurs when the parser tries to reduce by S ' --> S.

The Closure Operation

If I is a set of items for a grammar then closure (I ) is the set of items constructed from I by the following two rules:

  1. Every item in I is in closure (I ).
  2. If T --> . U is in closure (I ) and if U --> is a production in the grammar then add item U --> . to closure (I ).
Example 4.34: Consider the augmented expression grammar:

E ' --> E
E --> E + T
E --> T
T --> T * F
T --> F
F --> ( E )
F --> id

If set I = { E ' --> . E } then closure (I ) contains the following seven items:

E ' --> . E
E --> . E + T
E --> . T
T --> . T * F
T --> . F
F --> . ( E )
F --> . id

The Goto Operation

If I is a set of items and X is a grammar symbol then goto (I, X ) is the closure of the set of all items [T --> X . ] such that [T --> . X ] is in I.

Example 4.35: If I contains the following two items:

E ' --> E .
E --> E . + T

then goto (I, + ) contains the following five items:

E --> E + . T
T --> . T * F
T --> . F
F --> . ( E )
F --> . id

The Sets-of-Items Construction

A canonical collection of sets of items for an augmented grammar, C, is constructed as follows:

Example 4.36: Figure 4.35 shows the canonical collection of 12 sets constructed for the augmented grammar of example 4.34. Note that the first set in C, I0, is the closure of {[E ' --> . E ]} computed in example 4.34. The goto function for these 12 sets is shown as a transition diagram in figure 4.36.

SLR Parsing Tables

Given an augmented grammar, G ', construct the canonical collection, C, and the function FOLLOW(T ) for every nonterminal T in the grammar. Then for every set, Ii , in C construct state i in the parsing table as follows:

If any conflicting actions are generated by the above rules the grammar is not SLR(1). The grammar might be ambiguous or a more complex method such as Canonical LR or LALR must be used: section 4.7 of the text describes these other methods.

Example 4.38: The parsing tables for the expression grammar of example 4.33 can be constructed using the foregoing rules. The result is shown in figure 4.31.

A Comparison of Predictive Parsers with Shift-Reduce Parsers

Nonrecursive predictive parsers are discussed in section 4.4 and shift-reduce parsers are discussed in section 4.5. Both parsers read the input from left-to-right and maintain a stack of grammar symbols but their parsing operations are decidely different as shown in the following table:

Predictive ParserShift-Reduce Parser
Top-down (LL) ParserBottom-up (LR) Parser
Stack predicts what is to come Stack shows what has been seen so far
The stack initially contains the start-symbol of the grammar. The stack is initially empty.
The stack is empty when the accept state is reached. The stack contains the start symbol of the grammar when the accept state is reached.
Input tokens are popped off the stack. Input tokens are pushed on the stack.
Left sides of productions are popped off the stack. Right sides of productions are popped off the stack.
Right sides of productions are pushed on the stack. Left sides of productions are pushed on the stack.

Kenneth E. Batcher - 3/16/2005