Re: An "open" letter to Karsten Nyblad (and other compiler compiler implementors)

Chris F Clark <cfc@shell01.TheWorld.com>
23 Jun 2005 22:05:13 -0400

          From comp.compilers

Related articles
An "open" letter to Karsten Nyblad (and other compiler compiler implem cfc@shell01.TheWorld.com (Chris F Clark) (2005-06-18)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im vtsikoza@yahoo.com (2005-06-21)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im wyrmwif@tsoft.org (SM Ryan) (2005-06-22)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im qrczak@knm.org.pl (Marcin 'Qrczak' Kowalczyk) (2005-06-23)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im cfc@shell01.TheWorld.com (Chris F Clark) (2005-06-23)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im wyrmwif@tsoft.org (SM Ryan) (2005-06-24)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im vtsikoza@yahoo.com (2005-06-24)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im wyrmwif@tsoft.org (SM Ryan) (2005-06-24)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im cfc@shell01.TheWorld.com (Chris F Clark) (2005-06-24)
Re: An "open" letter to Karsten Nyblad (and other compiler compiler im schmitz@i3s.unice.fr (Sylvain Schmitz) (2005-06-26)
RE: An "open" letter to Karsten Nyblad (and other compiler compiler im quinn-j@shaw.ca (Quinn Tyler Jackson) (2005-06-30)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 23 Jun 2005 22:05:13 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 05-06-099 05-06-104
Keywords: LR(1), parse
Posted-Date: 23 Jun 2005 22:05:13 EDT

Vit wrote:
> I only wonder, have not languages designers become cleverer in the
> recent decades to invent languages that do not require the full
> strength of LR(k)?


SM Ryan:
> Why not just use LR(k)? Inertia? Methods to avoid the combinatorial
> explosion of lookaheads have been known for years.


Many practical languages are both LALR(1) and nearly LL(1) (LL(k <=
~4) with some hacks to handle nested if-then-else). It doesn't
generally buy one much to go to full LR(k).


However, that said many recent languages: C++, Perl, (I believe also
Python and Haskell) are not strictly in this category. The first two
are notoriously hard to parse. C++ requires at least controlled
backtracking (or something equivalent, such as GLR parsing) to
implement the "if it looks like a declaration it is, otherwise it's an
expression" rule. Perl has all sorts of context-sensitive syntaxes
where unless one knows the context (e.g. are we inside a regex or not)
the meaning of individual characters can very dramatically. I've
heard that the only correct and complete parser for Perl is the one
written in Perl. The last two languages use indentation as part of
the syntax and that doesn't have a convenient translation into
traditional context-free grammars.


Of these different problem languages, I would only expect Perl to be
potentially easier to parse with a fixed k LR(k) parser, as it might
be possible to increase the context enough to resolve some of the
syntax issues. So, in that sense LR(k) is not a solution.


On the other hand, LR(k) does make writing grammars easier. Sometimes
one has a construct that the natural grammar isn't quite LALR(1) and
one just needs a little help from either the left context (LR(1)) or
the right context (LALR(k)) to get the problem patch through. From
what Karsten wrote, it sounds like that is what is he building. It
sounds like he is borrowing a page from JavaCC (or pehaps it was PCCTS
that first introduced the idea (see note in next paragraph)), where
they have LOOKAHEAD(k) directives in the grammar to solve the same
problem in the LL realm.


Note: I think PCCTS automatically extended k as needed, where as
JavaCC gives (requires) the user control over where the larger k is
needed. However, it is worth noting that simply increasing k does not
necessarily resolve the problems, and some grammars are simply not
LL(k) or LR(k) for any k (ambiguous grammars are specifically in that
group) while other grammars may be. The worst part of that problem is
that it can be shown that there is no algorithm which can tell one
apriori for an arbitrary grammar whether increasing k will eventually
solve the problem or whether one will simply continue to increase k
forever (loop) and never converge on a solution.


I am interested in this part because we have within Yacc++ support for
LR(k) parsing (and a little more). However, that simply divides the
class of grammars into 3 classes: those grammars that Yacc++ resolve
and generate a parser for, those grammars which Yacc++ can show it
will never resolve (ones with irresolvable conflicts), and those which
Yacc++ loops on. The last class is problematic, one doesn't know if
running Yacc++ for a little longer will cause it to put the grammar
into one of the other classes. As a result, we don't make the LR(k)
support a "published" option. In our opinion, it isn't sufficently
helpful in resolving problems to justify the obnoxious behavior of not
telling the user what they need to fix. Therefore, we keep our
support at the LALR/LR(1) level and force the user to rewrite their
grammar. Karsten's idea of adding precedence-style directives is a
better compromise and we may follow suit. Another person who is (or
was at least for a while) pursuing this direction (a grammar formalism
larger than fixed k LR(k)) is Ralph Boland.


It is worth mentioning, that there are also other solutions to the
problem. Backtracking is one solution, with the downside that one no
longer is certain of linear parsing performance. Personally, I find
the controlled backtracking of "syntactic predicates" to be
particularly satisfying (and we borrowed the idea from PCCTS and
implemented them in Yacc++). Brian Ford's innovation of packrat
parsing is an interesting twist on the backtracking theme that
linearizes the resulting performance. (His ordered or operator is
also a solution to the ambiguity problem, but one which in my opinion
sweeps part of the problem under the rug if used as the default with
no conflict messages--however, if used in controlled manner would be
no worse than predicates or precedence declarations. In fact, the
ordered or can be mapped directly to a predicate solution. It's just
that one wants to use predicates judiciously if one is concerned with
an unambiguous and consistent language.)


GLR (or Earley) parsing is another solution. It has the advantage
that it can handle truly ambiguous grammars (so can general
backtracking). The disadvantage is that one doesn't know if one's
grammar is ambiguous when using that method. One can only know if for
the current input whether one got one parse tree out or whether one
got a parse forest (if one gets a parse forest, the grammar is
ambiguous). The GLR proponents then argue one can apply other rules
to trim the parse forest down into an effective tree (not necessarily
turning it into a tree, but resolving the ambiguities between the
different alternatives).


Many of these ideas can be combined. For example, predicates and
precedence directives can be used to tell the parser generator where
one wants it to use a more general solution (either backtracking or a
parse forest approach). The choice of backtracking versus parse
forest apporach is like the choice between depth-first and
breadth-first search, with backtracking representing the depth-first
approach. It's not quite an implementation issue, but for many
grammars it is close to it.


I haven't investigated the indentation problem. I suspect that a few
operators added to the grammar notation could capture that concept.
It doesn't look like an intrinsically hard problem, as one mostly
wants the equivalents of tokens that say here we has a line break the
increased indentation and here we had one the decreased it.


All in all, we are close to being able to formally parse all the
language ideas circa 1990 or so. In 1990, we had good formalisms to
roughly handle all the 1980 languages. So, we've lost a few years
ground, but it's easier to invent problems than formalize solutions.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.