Re: LR-regular parsers for dummies ?

Chris F Clark <world!cfc@uunet.uu.net>
31 Oct 1999 23:57:22 -0500

          From comp.compilers

Related articles
LR-regular parsers for dummies ? dvandeun@vub.ac.be (1999-10-28)
Re: LR-regular parsers for dummies ? laski@ics.uci.edu (Ziemowit Laski) (1999-10-29)
Re: LR-regular parsers for dummies ? chrisd@reservoir.com (Chris Dodd) (1999-10-29)
Re: LR-regular parsers for dummies ? dvandeun@vub.ac.be (1999-10-31)
Re: LR-regular parsers for dummies ? world!cfc@uunet.uu.net (Chris F Clark) (1999-10-31)
Re: LR-regular parsers for dummies ? world!cfc@uunet.uu.net (Chris F Clark) (1999-10-31)
| List of all articles for this month |

From: Chris F Clark <world!cfc@uunet.uu.net>
Newsgroups: comp.compilers
Date: 31 Oct 1999 23:57:22 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 99-10-142 99-10-171 99-10-176
Keywords: parse, LR(1)

dvandeun@vub.ac.be (Dirk van Deun) writes:
> Real grammars may tend to be LR(1) as you say, but embedding actions
> in a bison script may destroy the original LR(1)-ness (because a
> reduce decision must be made sooner.) More powerful parser generators
> might make life easier for the programmer and allow him/her to simply
> copy the language specification into a parser generating script and
> add embedded actions anywhere, without any need for tweaking.


Ah, if this is the class of grammars you are most interested in, then
there are three "popular" choices.


First, there are the GLR (or Tomita/Lang) grammars. Bernard Lang
discovered a way of effectively doing Earley parsing using LR tables.
Tomita did a practical implementation.


The advantages of this method are that one uses essentially the same
LR tables as one would to drive a traditional LR parser generator. I
say essentially, because at conflicts one enters all the conflicting
actions in the table, rather than selecting one action (and printing a
conflict error message). At run-time, the parsing engine travels all
the conflicting directions in parallel and builds a "parse forest"
representing all the possible parses. As some of the conflicting
potential parses run into errors, you remove them form the forest.


When, you get done, if there was exactly one legal parse, the parse
forest gets pruned down to the correct parse tree. BTW, this holds
for any unambiguous CFL, any unambiguous CFL will produce a parse tree
using this method. If the grammar was ambiguous, the result of the
parse is a parse forest showing all the potential parses.


The disadvantage of this method (to my mind) is that it doesn't tell
you if your grammar is ambiguous or not. It can only tell you that
your grammar is ambiguous if you feed it an input that happens to
generate a forest and not a tree. If you still generate conflict
messages at the point of adding parallel actions to the table, you can
still know whether the grammar is LR or not. However, you need a
separate reasoning process if you know your grammar is not LR, but
think it still might be unambiguous but aren't sure. Given that the
general ambiguity problem for CFL's is undecidable, this may be the
best one can do.


It is also worth mentioning that the extra work to maintain the parse
forest, means that the algorithm is not linear (even if the forest
resolves to a tree). However, since Earley's algorithm can be shown
to operate in N**3 time (and Thomas Schoebel-Theuer has apparently
developed an N**2 version (my German is not sufficient to read all the
details his work to verify that)), the GLR parsers should be codable
to do the same.


Web searches on Tomita and GLR will turn up several parsers that
implement this technology. The CWI group (Jan Rekers et. al.) in
Amsterdam and Visual Parse++ are two examples.


Second, there are backtracking parsing techniques. Instead of
pursuing the paths in parallel, it is also possible to pursue the
paths using a depth first search.


The advantages of this approach is that if one picks the correct path
to begin with and the grammar is unambiguous, then the running time is
still linear (the worst case bounds for unambiguous grammars is again
N**2). In addition, one will always get a unique parse tree, even if
the grammar is unambiguous. (That can be considered either a benefit
or drawback, since you lose the information of the parses that were
not selected.) Unfortunately, some backtracking approaches can get
lost in pursuing infinite ambiguities if the wrong path is selected.


BTYacc is a version of yacc augmented to do backtracking.


Third, a variation on the backtracking schemes are the predicated
parsers. In these schemes, one adds syntactic (or semantic)
predicates that guide the parse. The predicates are generally
processed by backtracking--i.e. a depth first search of the predicates
in a well-defined order. However, once a predicate is matched,
parsing is then linear using the standard LL or LR algorithm.


The advantages and disadvantages of this approach are the same as the
backtracking approach, with one addition. The predicates can be
considered a specification of how the backtracking is to be performed
(thus giving the grammar writer control over the backtracking
process). A useful way of thinking of it, is that at any point where
predicates apply, the set of predicates become a little language that
is resolved in advance to determine how the parse is to proceed.
(Notice how that is similar to how the LR-regular approach might allow
users to write the separating sets.) The little language is processed
using a powerful parsing method (the choice is traditionally
backtracking, but one could do GLR instead) to give it extra "parsing
power". Note, that in practice predicates are only added to
productions that cannot be resolved by the more traditional techniques
of conflict resolution.


PCCTS/ANTLR and Yacc++ are predicated parsers.


I believe any of the above methods is sufficient to handle a wide
variety of non-LR languages (e.g. palindromes and a**i b**i c**i).


However, one caveat is worth mentioning, when it comes to actions ANY
parsing methodology more powerful than LL(1) may have to look-ahead in
the token stream past the next token to determine what action to
apply. Thus, if your grammar depends on having the actions be applied
precisely in order as the tokens are being read (i.e. it must be
on-line), then your grammar must be locally LL(1) (i.e. no look-ahead
or backtracking required; there can be ambiguities provided that all
the ambiguities imply the same action) at each action point and no
parsing method can help you. If you can wait to apply the actions
(i.e. the actions can be off-line), and one usually can, then any of
the above methods will work.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
------------------------------------------------------------------------------
Web Site in Progress: Web Site : http://world.std.com/~compres


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.