Re: LL(1) vs LL(k)

mickunas@m.cs.uiuc.edu (Dennis Mickunas)
Wed, 13 May 1992 17:11:20 GMT

          From comp.compilers

Related articles
LL(1) vs LL(k) parrt@ecn.purdue.edu (1992-05-09)
Re: LL(1) vs LL(k) jos@and.nl (1992-05-09)
Re: LL(1) vs LL(k) j-grout@uiuc.edu (1992-05-09)
Re: LL(1) vs LL(k) mickunas@m.cs.uiuc.edu (1992-05-13)
| List of all articles for this month |

Newsgroups: comp.compilers
From: mickunas@m.cs.uiuc.edu (Dennis Mickunas)
Keywords: parse, LR(1), bibliography
Organization: University of Illinois, Dept. of Comp. Sci., Urbana, IL
References: 92-05-050 92-05-054
Date: Wed, 13 May 1992 17:11:20 GMT

              Recent postings have correctly mentioned that a language with an
"end-marker" is LR(0) parsable. This is just an easy way of making the
language "prefix-free", which is one of a number of characterizations of
LR(0) languages (see for example, Michael Harrison's text [1], Section
13.3; also [1] cites other relevant papers, especially by Harrison &
Havel, and Geller & Harrison). However, there still seems to be some
mystery surrounding the mechanics of LR(0) parsing:


parrt@ecn.purdue.edu (Terence J Parr) writes:
>Given any deterministic language L, L$ (L appended with EOF) can always
>be described with an LR(0) grammar [Knuth65].


and John R. Grout, j-grout@uiuc.edu writes:
>IMHO, this result was of limited value without a usable, deterministic
>parser associated with the grammar... obviously, the canonical parser for
>the LR(0) grammar cited by this result would almost always be
>non-deterministic.


>Years of hard work determined how to efficiently augment a
>non-deterministic LR(0) parser with look-ahead sets and operator
>precedence (especially DeRemer and Pennello's elegant formulation of LALR)
>to make it deterministic for most languages... then (and only then) could
>parsers based on LR(0) grammars be really useful.


              Of course, an LR(0) parser *cannot* have any "nondeterminism".
Surely Mr. Grout is confusing a "Canonical Set of LR(0) Items" (and
associated ACTION and GOTO tables) with an LR(0) parser itself.
Operationally, one may *attempt* to build an LR(0) parser by undertaking
DeRemer's construction of computing the Canonical Set of LR(0) Items. Any
resulting "conflicts" must be resolved by look- ahead; the need for such
look-ahead implies that one has *failed* to produce an LR(0) parser, i.e.
that the grammar was *not* LR(0) to begin with.


              To say that a *language* is LR(0) (perhaps because it has an
end-marker) is to say that *there exists* an LR(0) grammar for the
language. The rub is that obtaining such an LR(0) grammar is easier said
than done. Human beings typically write grammars which are LR(1), but not
LR(0). Operationally, a typical human-written grammar will cause an LR(0)
parser construction algorithm to fail.


              What no one has yet mentioned in this discussion is that once you
have obtained an LR(k) grammar, G, it is possible to obtain *mechanically*
an LR(0) grammar, G' (provided that the language is prefix-free). In
fact, the resulting LR(0) grammar "completely covers" the original LR(k)
grammar (meaning that using only table look-up, a G'-parse may be
transformed into the original G-parse). Those who are interested in cover
grammars should see papers by James Gray & Michael Harrison [2], or the
work by Anton Nijholt [3,4]; the formal LR(k) to LR(0) algorithms are
presented in [5,6].


              In the remainder of this note, I'll try to demonstrate one informal
technique for reducing the look-ahead of an LR(k) grammar. This is not
the technique which is presented in [5,6], but it conveys the general
idea, and it usually works rather well in practice.


              I'll assume that we are presented with a grammar, G, which is LR(1)
(but not LR(0)), and that the grammar is "augmented" with an end-marker
rule. For example, the following yacc-able grammar is the conventional,
unambiguous, left-recursive, LR(1) grammar for expressions.


          %token ENDMARK ADDOP MULOP LPAR RPAR ID
          %%
          S : expr ENDMARK ;
          expr : term ;
          expr : expr ADDOP term ;
          term : factor ;
          term : term MULOP factor ;
          factor : LPAR expr RPAR ;
          factor : ID ;


Here's how to transform from G into an LR(0) grammar, G'.


              The first step is to transform G into a grammar which is in
operator form (so that there are never two adjacent non-terminals in the
right-part of any rule). G is already in operator form.


              The next step is to compute the FOLLOW(A) set for every non-
terminal symbol, "A". This is rather easily done once the grammar is in
operator form.


              The next step is to generate a family of rules from each of the
original rules. If there is a rule:


              A : alpha


(where alpha is any string) in G, then for every terminal symbol, "a" in
FOLLOW(A), include in G' a new rule:


              A_a : alpha a


where "A_a" is a new non-terminal symbol, thought of as the "fusion" of
"A" and one of its FOLLOW terminal symbols, "a". Thus, for example, in
our sample expression grammar, we see that


              FOLLOW(expr) = {ENDMARK, ADDOP, RPAR}


so the original rule in G:


              expr : expr ADDOP term


gives rise to 3 new rules in G':


              expr_ENDMARK : expr ADDOP term ENDMARK
              expr_ADDOP : expr ADDOP term ADDOP
              expr_RPAR : expr ADDOP term RPAR


              In the final step, we "fuse" pairs of non-terminal/terminal symbols
whenever they appear in the right-parts of rules. Thus, the above 3 rules
are transformed to:


              expr_ENDMARK : expr_ADDOP term_ENDMARK
              expr_ADDOP : expr_ADDOP term_ADDOP
              expr_RPAR : expr_ADDOP term_RPAR


              The net result of this transformation is the following grammar, G',
which is LR(0). If you run G' through yacc, you'll see that 1) there are
no shift-reduce or reduce-reduce conflicts, and; 2) there is indeed a
single "LR(0) reduce state" for each of the production rules. (Recall, an
LR(0) reduce state is a state in which the *only* action is to reduce by
some rule; only *one* reduction is specified, and there are no shift
actions for such an LR(0) reduce state.) The proof that L(G')=L(G) is not
difficult (see [5,6] for the technique).


          %token ENDMARK ADDOP MULOP LPAR RPAR ID
          %%
          S : expr_ENDMARK ;
          expr_ENDMARK : term_ENDMARK ;
          expr_ENDMARK : expr_ADDOP term_ENDMARK ;
          expr_ADDOP : term_ADDOP ;
          expr_ADDOP : expr_ADDOP term_ADDOP ;
          expr_RPAR : term_RPAR ;
          expr_RPAR : expr_ADDOP term_RPAR ;
          term_ENDMARK : factor_ENDMARK ;
          term_ENDMARK : term_MULOP factor_ENDMARK ;
          term_ADDOP : factor_ADDOP ;
          term_ADDOP : term_MULOP factor_ADDOP ;
          term_MULOP : factor_MULOP ;
          term_MULOP : term_MULOP factor_MULOP ;
          term_RPAR : factor_RPAR ;
          term_RPAR : term_MULOP factor_RPAR ;
          factor_ENDMARK : LPAR expr_RPAR ENDMARK ;
          factor_ENDMARK : ID ENDMARK ;
          factor_ADDOP : LPAR expr_RPAR ADDOP ;
          factor_ADDOP : ID ADDOP ;
          factor_MULOP : LPAR expr_RPAR MULOP ;
          factor_MULOP : ID MULOP ;
          factor_RPAR : LPAR expr_RPAR RPAR ;
          factor_RPAR : ID RPAR ;


              Of course, the downside to the transformation is that the grammar
has increased rather dramatically in size. The LR(1) grammar, G, had 8
rules and induced a parser with 13 states using Berkeley yacc; the LR(0)
grammar, G', has 34 rules and induces a parser with 35 states.


              But the real test of this transformation comes next. I have
constructed a rather straightforward LR(1) grammar for Pascal, adapted
from the syntax description in [7]. Terminal symbols are at the lexical
level of IDENTIFIER, UNSIGNED_INTEGER, UNSIGNED_REAL, STRING, ADDOP,
MULOP, RELOP and the rest of Pascal's special symbols and reserved words.
The grammar contains 148 production rules, and induces a parser with 340
states using Berkeley yacc. It has been cast into operator form (which
was not a terribly difficult task).


              The mechanically transformed LR(0) grammar has 808 (!) production
rules, and induces a parser with 1662 (!) states using Berkeley yacc. (Of
course, almost half of those states are LR(0) reduce states. The reason
we use Berkeley yacc is that many versions of yacc cannot handle so many
rules.)


              The yacc-able LR(1) Pascal grammar, the transformed LR(0) grammar,
as well as a modified version of this note are available in


                pub/pascal.lr1
                pub/pascal.lr0
                pub/pascal.descr


via anonymous ftp from a.cs.uiuc.edu. Although pascal.lr0 is copyrighted,
it is made freely available with the only requirement that you cite its
source.


[1] Harrison, M. A., _Introduction to Formal Language Theory_,
        Addison-Wesley, Reading, MA. (1978).


[2] Gray, J.N., and M.A. Harrison, "On the covering and reduction
        problems for context-free grammars," _Journal of the ACM 19_,
        pp. 675-698 (1972).


[3] Nijholt, A., "On the Covering of Parsable Grammars," _Journal
        of Computer and System Sciences 15_, pp.99-110 (1977).


[4] Nijholt, A., "A Survey of Normal Form Covers for Context Free
        Grammars," Informatica Rapport 49, Vrije Universiteit (1979).


[5] Mickunas, M.D., "On the Complete Covering Problem for LR(k)
        Grammars," _Journal of the ACM 23_, pp.17-30 (1976).


[6] Mickunas, M.D., R.L. Lancaster, and V.B. Schneider, "Transforming
        LR(k) Grammars to LR(1), SLR(1), and (1,1) Bounded Right-Context
        Grammars," _Journal of the ACM 23_, pp.511-533 (1976).


[7] Wirth, N. and K. Jensen, _Pascal User Manual and Report_ 2nd Ed.,
        Springer-Verlag, New York, Heidelberg, Berlin (1974).


M. Dennis Mickunas
Department of Computer Science 1304 W. Springfield Ave.
University of Illinois Urbana, Illinois 61801
mickunas@cs.uiuc.edu (217) 333-6351
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.