Re: Parsing Expression Grammar

Chris F Clark <cfc@TheWorld.com>
27 Sep 2005 09:44:34 -0400

          From comp.compilers

Related articles
[24 earlier articles]
Re: Parsing Expression Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2005-09-22)
Re: Parsing Expression Grammar schmitz@i3s.unice.fr (Sylvain Schmitz) (2005-09-22)
Re: Parsing Expression Grammar paul@parsetec.com (Paul Mann) (2005-09-22)
Re: Parsing Expression Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2005-09-23)
Re: Parsing Expression Grammar gneuner2@comcast.net (George Neuner) (2005-09-25)
Re: Parsing Expression Grammar cfc@world.std.com (Chris F Clark) (2005-09-25)
Re: Parsing Expression Grammar cfc@TheWorld.com (Chris F Clark) (2005-09-27)
Re: Parsing Expression Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2005-09-27)
Re: Parsing Expression Grammar gneuner2@comcast.net (George Neuner) (2005-09-30)
Re: Parsing Expression Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2005-10-02)
| List of all articles for this month |

From: Chris F Clark <cfc@TheWorld.com>
Newsgroups: comp.compilers
Date: 27 Sep 2005 09:44:34 -0400
Organization: Compilers Central
References: 05-09-023 05-09-045 05-09-053 05-09-107 05-09-121 <sddek7cplld.fsf@world.std.com> <6.2.1.2.0.20050926062516.01e00bc0@mail.comcast.net>
Keywords: parse
Posted-Date: 27 Sep 2005 09:44:34 EDT

George wrote this very nice example which illustrates the points about
lookahead and context being discussed well.


> My point was that the addition of the action changes the semantics
> so the grammar written as
>
> rule1 : token1 token2 {action1} token3
>
> will actually be parsed as if it were written:
>
> rule1 : rule2 {action1} token3
> rule2 : token1 token2


Or something equivalent, as in the more likely implementation of:


                  rule1 : rule2 token3
                  rule2 : token1 token2 {action1}


> The question then becomes whether the two grammars are equivalent ...
> which is only true if the action is either stateless or reversible.


Or the parsing context is unique at that point. If there is no
conflict at that point, because the parse only has one choice at that
point, then the action can be performed irreversibly.


The context *will be* unique at that point if the grammar is LL(1).
This was my earlier point.


If one has an LL(1) grammar, one can perform the transformation of
splitting the rule to place the action code (any place after the first
item) and still have an LL(1) grammar. Moreover, if the unsplit LL(1)
grammar was SLR (LALR or LR), the split grammar will be too. The
LL(1) condition on first sets guarantees this, because the first item
of any LL(1) rule must be unambiguous. Thus, if you are in the middle
of an LL(1) rule, then you know that there is a unique rule that
applies at that point. You can, of course, make the same argument for
the k-th token of an LL(k) grammar.


We will see the importance of this case, when we deal with George's
next grammar.


> Having it anywhere can introduce conflicts. Consider
>
> rule1 : token1 {action1} token2
> rule2 : token1 token3
>
> The rules are not ambiguous, but because the action is not subject
> to look ahead, the parser can't tell which way to go after recognizing
> the common prefix. Logically action1 has to be executed before
> token2 can be recognized ... not parsed, recognized. That means
> the action must be taken *before* look ahead.


Actually the rules are ambiguous, because you cannot tell whether to
execute the action until you have seen the token after the actions.
In particular, this grammar is not LL(1). The first token of the rule
does not disambiguate whether to apply rule1 or rule2.


Unfortunately, many parser generators (including I must ruefully admit
Yacc++) do not flag this ambiguity, they simply read one token ahead
and implement the grammar wrong. Of course, there is no correct
implementation of this grammar, as it specifies something which cannot
be done. One cannot know whether to execute the action until the
lookahead has been read to see if the token read is a token2.
However, if the action somehow affects whether a token2 or token3 is
returned from the lexer, one has a circularity.


Effectively, by using the lookahead one is moving the action to a
later point in the grammar, changing the rules to:


                    rule1 : token1 token2 {action1}
                    rule2 : token1 token3


Note, because of this issue, I have considered introducing a new
syntax into Yacc++ to indicate actions that should not be "moved"
(i.e. have the tokens after them used as lookahead for
disambiguation). I just haven't figured out what that syntax should
be. That way if the rule was important for getting the token lexed
right and it appear in a context where the lookahead was important for
knowing whether to apply the action, one would get an error message
indicating the circularity.


> This is where PDA and recursive descent part company. Neither
> can use look ahead alone to decide what to do, but the RD parser
> in most cases can handle the problem more efficiently.
>
> The PDA answer has been to fork the machine, [logically] try all
> branches simultaneously, and prune the branches that fail. However,
> this is generally more costly than a recursive decent approach using
> backtracking. Forking in an RD parser is only necessary if there are
> conflicting actions, as in
>
> rule1 : token1 {action1} token2 token3
> rule2 : token1 {action2} token4 token5
>
> and the scope of the fork in RD is limited to recognition of the first
> token following the conflicting actions.


An LR (inclding SLR or LALR) parser generator does not need to fork
the PDA to solve this problem. If the conflict can be solved with a
fixed amount of lookahead, the parser simply looks that many tokens
ahead. This is the same as an LL(k) parser does. (I want to emphsize
this point. For fixed lookahead an LR parser does not need to fork
copies of the PDA. The parallelism can be embedded in the PDA, the
same way that DFA's encode NFA's.)


Effectively, by using the lookahead one is moving the action to a
later point in the grammar, changing the rules to:


                    rule1 : token1 token2 {action1} token3
                    rule2 : token1 token4 {action2} token5


As noted above, I consider that solution "wrong" for this case, as it
changes when the action occurs. However, it is in some sense
convenient. Especially if the rules are combined in different
contexts and the action is unambiguous in some and ambiguous in
others, but you want the effect of the action to be the same.


Now if fixed lookahead cannot solve the problem, then one needs
something more powerful that LL or LR parsing. Here is an example of
that case:


                    rule1 : token1 {action1} rule3 token3
                    rule2 : token1 {action2} rule3 token5
rule3 : token2 | token4 rule3


In this case, one needs either backtracking or GLR parsing or the
ability to move the action across non-terminals, as in:


                    rule1 : token1 rule3 token3 {action1}
                    rule2 : token1 rule3 token5 {action2}
rule3 : token2 | token4 rule3


Now, the above is the actual "correct" grammar for the case, with no
actions that require lookahead. However, I know of no parser
generator that will implicitly make "that" correction. Since the
correction radically changes the point at which the action is
executed, it is perhaps good that no parser generator does that.


The backtracking solution works like one has a predicated grammar. In
fact, I think of predicates as indicating what parts of the grammar
one wishes to backtrack over. Here is a predicated version of the
grammar:


                    rule1 : token1 (rule3 token3) >> {action1} rule3 token3
                    rule2 : token1 (rule3 token5) >> {action2} rule3 token5
rule3 : token2 | token4 rule3


The notation used above for predicates, says the text in parenthesis
preceding the >> operator is a predicate. That is the parser parses
the text in the parenthesis as "lookahead" before continuing the
parse. If the text is matched, the parse continues with the predicate
text rewound (placed back in the input, backtracked over) before the
rest of the rule is matched. If the text isn't matched, the parse
does not apply the rule and rewinds the text before attempting some
other rule. Note, in Bryan Ford's notation this is an & predicate.
There is also an inverse predicate << that says continue parsing only
if the predicate does not match. I don't recall Bryan Ford's notation
for that.


> Cloning the PDA also creates bookkeeping problems - like what to do
> with side effects like AST building - which can usually be avoided using
> the RD approach because the side effects can be deferred until the
> correct rule context is known.


As George correctly notes, backtracking is not the only solution. GLR
parsers don't require predicates and simply fork copies of the parser
PDA at these conflict points. His issues with the forking that GLR
does are generally well founded, but are probably overly cautious for
the typical case.


There is yet another option which hasn't been addressed. The GLR
problem with forking is that one ends up with potentially many forks,
since each input symbol can potentially introduce forks for each
different alternative. However, the way that LR parsers handle the
parallel branches of the PDA shows that there can be a low cost
solution. Instead of backtracking or forking parallel parsers, one
can simply compute the lookahead PDA using a variant of the LR
algorithm. If the grammar is unambiguous, this PDA exists as the
limit of a closure. This technique I call LR(infinite) or LR(closed)
parsing. (The only difficulty is that if the grammar is not
unambiguous, the construction method can loop and one cannot tell if
the loop will ever converge. This is a side-effect of the halting
problem.)


To close, it is worth mentioning that for predicated grammars, we
implemented them using backtracking in Yacc++. We have done
experiments with LR(closed) parsing using Yacc++, but the results have
not led us to think that it is a panacea that is ready for general use.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.