Wed, 13 May 1992 17:11:20 GMT

Related articles |
---|

LL(1) vs LL(k) parrt@ecn.purdue.edu (1992-05-09) |

Re: LL(1) vs LL(k) jos@and.nl (1992-05-09) |

Re: LL(1) vs LL(k) j-grout@uiuc.edu (1992-05-09) |

Re: LL(1) vs LL(k) mickunas@m.cs.uiuc.edu (1992-05-13) |

Newsgroups: | comp.compilers |

From: | mickunas@m.cs.uiuc.edu (Dennis Mickunas) |

Keywords: | parse, LR(1), bibliography |

Organization: | University of Illinois, Dept. of Comp. Sci., Urbana, IL |

References: | 92-05-050 92-05-054 |

Date: | Wed, 13 May 1992 17:11:20 GMT |

Recent postings have correctly mentioned that a language with an

"end-marker" is LR(0) parsable. This is just an easy way of making the

language "prefix-free", which is one of a number of characterizations of

LR(0) languages (see for example, Michael Harrison's text [1], Section

13.3; also [1] cites other relevant papers, especially by Harrison &

Havel, and Geller & Harrison). However, there still seems to be some

mystery surrounding the mechanics of LR(0) parsing:

parrt@ecn.purdue.edu (Terence J Parr) writes:

*>Given any deterministic language L, L$ (L appended with EOF) can always*

*>be described with an LR(0) grammar [Knuth65].*

and John R. Grout, j-grout@uiuc.edu writes:

*>IMHO, this result was of limited value without a usable, deterministic*

*>parser associated with the grammar... obviously, the canonical parser for*

*>the LR(0) grammar cited by this result would almost always be*

*>non-deterministic.*

*>Years of hard work determined how to efficiently augment a*

*>non-deterministic LR(0) parser with look-ahead sets and operator*

*>precedence (especially DeRemer and Pennello's elegant formulation of LALR)*

*>to make it deterministic for most languages... then (and only then) could*

*>parsers based on LR(0) grammars be really useful.*

Of course, an LR(0) parser *cannot* have any "nondeterminism".

Surely Mr. Grout is confusing a "Canonical Set of LR(0) Items" (and

associated ACTION and GOTO tables) with an LR(0) parser itself.

Operationally, one may *attempt* to build an LR(0) parser by undertaking

DeRemer's construction of computing the Canonical Set of LR(0) Items. Any

resulting "conflicts" must be resolved by look- ahead; the need for such

look-ahead implies that one has *failed* to produce an LR(0) parser, i.e.

that the grammar was *not* LR(0) to begin with.

To say that a *language* is LR(0) (perhaps because it has an

end-marker) is to say that *there exists* an LR(0) grammar for the

language. The rub is that obtaining such an LR(0) grammar is easier said

than done. Human beings typically write grammars which are LR(1), but not

LR(0). Operationally, a typical human-written grammar will cause an LR(0)

parser construction algorithm to fail.

What no one has yet mentioned in this discussion is that once you

have obtained an LR(k) grammar, G, it is possible to obtain *mechanically*

an LR(0) grammar, G' (provided that the language is prefix-free). In

fact, the resulting LR(0) grammar "completely covers" the original LR(k)

grammar (meaning that using only table look-up, a G'-parse may be

transformed into the original G-parse). Those who are interested in cover

grammars should see papers by James Gray & Michael Harrison [2], or the

work by Anton Nijholt [3,4]; the formal LR(k) to LR(0) algorithms are

presented in [5,6].

In the remainder of this note, I'll try to demonstrate one informal

technique for reducing the look-ahead of an LR(k) grammar. This is not

the technique which is presented in [5,6], but it conveys the general

idea, and it usually works rather well in practice.

I'll assume that we are presented with a grammar, G, which is LR(1)

(but not LR(0)), and that the grammar is "augmented" with an end-marker

rule. For example, the following yacc-able grammar is the conventional,

unambiguous, left-recursive, LR(1) grammar for expressions.

%token ENDMARK ADDOP MULOP LPAR RPAR ID

%%

S : expr ENDMARK ;

expr : term ;

expr : expr ADDOP term ;

term : factor ;

term : term MULOP factor ;

factor : LPAR expr RPAR ;

factor : ID ;

Here's how to transform from G into an LR(0) grammar, G'.

The first step is to transform G into a grammar which is in

operator form (so that there are never two adjacent non-terminals in the

right-part of any rule). G is already in operator form.

The next step is to compute the FOLLOW(A) set for every non-

terminal symbol, "A". This is rather easily done once the grammar is in

operator form.

The next step is to generate a family of rules from each of the

original rules. If there is a rule:

A : alpha

(where alpha is any string) in G, then for every terminal symbol, "a" in

FOLLOW(A), include in G' a new rule:

A_a : alpha a

where "A_a" is a new non-terminal symbol, thought of as the "fusion" of

"A" and one of its FOLLOW terminal symbols, "a". Thus, for example, in

our sample expression grammar, we see that

FOLLOW(expr) = {ENDMARK, ADDOP, RPAR}

so the original rule in G:

expr : expr ADDOP term

gives rise to 3 new rules in G':

expr_ENDMARK : expr ADDOP term ENDMARK

expr_ADDOP : expr ADDOP term ADDOP

expr_RPAR : expr ADDOP term RPAR

In the final step, we "fuse" pairs of non-terminal/terminal symbols

whenever they appear in the right-parts of rules. Thus, the above 3 rules

are transformed to:

expr_ENDMARK : expr_ADDOP term_ENDMARK

expr_ADDOP : expr_ADDOP term_ADDOP

expr_RPAR : expr_ADDOP term_RPAR

The net result of this transformation is the following grammar, G',

which is LR(0). If you run G' through yacc, you'll see that 1) there are

no shift-reduce or reduce-reduce conflicts, and; 2) there is indeed a

single "LR(0) reduce state" for each of the production rules. (Recall, an

LR(0) reduce state is a state in which the *only* action is to reduce by

some rule; only *one* reduction is specified, and there are no shift

actions for such an LR(0) reduce state.) The proof that L(G')=L(G) is not

difficult (see [5,6] for the technique).

%token ENDMARK ADDOP MULOP LPAR RPAR ID

%%

S : expr_ENDMARK ;

expr_ENDMARK : term_ENDMARK ;

expr_ENDMARK : expr_ADDOP term_ENDMARK ;

expr_ADDOP : term_ADDOP ;

expr_ADDOP : expr_ADDOP term_ADDOP ;

expr_RPAR : term_RPAR ;

expr_RPAR : expr_ADDOP term_RPAR ;

term_ENDMARK : factor_ENDMARK ;

term_ENDMARK : term_MULOP factor_ENDMARK ;

term_ADDOP : factor_ADDOP ;

term_ADDOP : term_MULOP factor_ADDOP ;

term_MULOP : factor_MULOP ;

term_MULOP : term_MULOP factor_MULOP ;

term_RPAR : factor_RPAR ;

term_RPAR : term_MULOP factor_RPAR ;

factor_ENDMARK : LPAR expr_RPAR ENDMARK ;

factor_ENDMARK : ID ENDMARK ;

factor_ADDOP : LPAR expr_RPAR ADDOP ;

factor_ADDOP : ID ADDOP ;

factor_MULOP : LPAR expr_RPAR MULOP ;

factor_MULOP : ID MULOP ;

factor_RPAR : LPAR expr_RPAR RPAR ;

factor_RPAR : ID RPAR ;

Of course, the downside to the transformation is that the grammar

has increased rather dramatically in size. The LR(1) grammar, G, had 8

rules and induced a parser with 13 states using Berkeley yacc; the LR(0)

grammar, G', has 34 rules and induces a parser with 35 states.

But the real test of this transformation comes next. I have

constructed a rather straightforward LR(1) grammar for Pascal, adapted

from the syntax description in [7]. Terminal symbols are at the lexical

level of IDENTIFIER, UNSIGNED_INTEGER, UNSIGNED_REAL, STRING, ADDOP,

MULOP, RELOP and the rest of Pascal's special symbols and reserved words.

The grammar contains 148 production rules, and induces a parser with 340

states using Berkeley yacc. It has been cast into operator form (which

was not a terribly difficult task).

The mechanically transformed LR(0) grammar has 808 (!) production

rules, and induces a parser with 1662 (!) states using Berkeley yacc. (Of

course, almost half of those states are LR(0) reduce states. The reason

we use Berkeley yacc is that many versions of yacc cannot handle so many

rules.)

The yacc-able LR(1) Pascal grammar, the transformed LR(0) grammar,

as well as a modified version of this note are available in

pub/pascal.lr1

pub/pascal.lr0

pub/pascal.descr

via anonymous ftp from a.cs.uiuc.edu. Although pascal.lr0 is copyrighted,

it is made freely available with the only requirement that you cite its

source.

[1] Harrison, M. A., _Introduction to Formal Language Theory_,

Addison-Wesley, Reading, MA. (1978).

[2] Gray, J.N., and M.A. Harrison, "On the covering and reduction

problems for context-free grammars," _Journal of the ACM 19_,

pp. 675-698 (1972).

[3] Nijholt, A., "On the Covering of Parsable Grammars," _Journal

of Computer and System Sciences 15_, pp.99-110 (1977).

[4] Nijholt, A., "A Survey of Normal Form Covers for Context Free

Grammars," Informatica Rapport 49, Vrije Universiteit (1979).

[5] Mickunas, M.D., "On the Complete Covering Problem for LR(k)

Grammars," _Journal of the ACM 23_, pp.17-30 (1976).

[6] Mickunas, M.D., R.L. Lancaster, and V.B. Schneider, "Transforming

LR(k) Grammars to LR(1), SLR(1), and (1,1) Bounded Right-Context

Grammars," _Journal of the ACM 23_, pp.511-533 (1976).

[7] Wirth, N. and K. Jensen, _Pascal User Manual and Report_ 2nd Ed.,

Springer-Verlag, New York, Heidelberg, Berlin (1974).

M. Dennis Mickunas

Department of Computer Science 1304 W. Springfield Ave.

University of Illinois Urbana, Illinois 61801

mickunas@cs.uiuc.edu (217) 333-6351

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.