Re: A Grammar Writing Question

Chris F Clark <cfc@shell01.TheWorld.com>
Thu, 26 Jul 2007 18:21:38 -0400

          From comp.compilers

Related articles
Re: Grammar for optional elements cfc@shell01.TheWorld.com (Chris F Clark) (2007-06-21)
A Grammar Writing Question lowell@coasttocoastresearch.com (Lowell Thomas) (2007-07-23)
Re: A Grammar Writing Question cfc@shell01.TheWorld.com (Chris F Clark) (2007-07-26)
Re: A Grammar Writing Question gah@ugcs.caltech.edu (glen herrmannsfeldt) (2007-07-26)
Re: A Grammar Writing Question stephenhorne100@aol.com (Steve Horne) (2007-07-27)
Re: A Grammar Writing Question DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-07-27)
Re: A Grammar Writing Question cfc@shell01.TheWorld.com (Chris F Clark) (2007-07-28)
Re: A Grammar Writing Question schmitz@i3s.unice.fr (Sylvain Schmitz) (2007-07-29)
A Grammar Writing Question lowell@coasttocoastresearch.com (Lowell Thomas) (2007-08-07)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: Thu, 26 Jul 2007 18:21:38 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: <07-06-053@comp.compilers 07-07-081
Keywords: parse, theory
Posted-Date: 26 Jul 2007 23:22:53 EDT

Thank you Lowell for asking the following provocative questions, which
I've reordered a bit to address in the order that I want to reply:


> I've often suspected that the reason Ford couldn't prove that there
> are context-free languages that cannot be described with a PEG is
> that none exist.


That is certainly possible, but I don't think that's the issue.


> So the question is, can anyone point out a practical situation where
> the language truly requires an exhaustive parser (one that pursues
> all possible ambiguities)? Also, can anyone provide a practical
> example of an exponential grammar (O(n**x), n = string length, x>1)?


I believe the C++ rule of "if it can be a declaration it is, but
otherwise it is an expression" (do I have that right or reversed)
requires solving arbitrary ambiguities. In fact, I think Jim Roskind
effectively proved that in his C++ grammar exploration work. He
essentially showed that for any sequence of tokens there was another
token that could be added that would invert the expression/declaration
nature of the sequence considered. Worse, I think that some of the
devious examples he came up with required not only semantic input
(i.e. which identifiers were declared as what coming into the
sequence), but could change which identifiers were being declared (as
well as what they were being declared as).


Now, perhaps in hindsight, you would know not to follows C++'s example
in this particular case. However, I think it proves that perfectly
competent language designers can design unreasonable problematic
features into their languages without knowing that they are doing so.
The resulting grammars can have ambiguities of the very worst kinds.
If you face such things, you may need the most powerful tools and
methodologies you can get. However, you might want to avoid designing
such things in the first place.


> In my limited experience with grammar (re-)writing I have never come
> across a language situation that I couldn't resolve with an
> ordered-choice parser (OCP, one that tries alternates from an
> ordered list and accepts the first match, ignoring any other
> possible matches in the list.)


I think for sane languages this is not a problem. However, the key
issue here (for me) is preventing insanity. And, an OCP (PEG) does
not protect you from writing something which makes some fragment look
legal to the casual eye, but which isn't actually legal, the question
you mention here:


> I've often seen criticisms of OCP (but don't have any references
> handy) on the grounds that they miss some of the valid language
> strings that the grammar allows. This is, of course, true if your
> objective is to start with a grammar and then ask for the complete
> language that it describes. But it seems to me that most practical
> situations have it the other way around. Given a specific problem,
> you first conceptualize the kinds of strings the language must
> contain and then try to write a grammar that generates them.


Ok, so you conceptualize some strings and write a grammar for some of
them. Now, you conceptualize some more strings and you start
adjusting your grammar. This, is where the question of the entire
list of valid strings becomes important.


As you add new rules to your grammar, how are you certain that there
are no rules in your grammar that have become inaccessible for
interesting cases? (And remember that we are in general talking about a
recursive grammar where there may be several different ways of
reaching the nested rules.)


Now, you are probably asking, how is that possible, for some rule to
become inaccessible? It just needs a prefix that is covered by some
higher priority (earlier ordered choice). Here is a simple contrived
case that illustrates the point.


grammar: part1 part2;


part1: "a" b" // I will use the / to indicate the OCP "or" operator
          / "a"
          ;


part2: "b" "c"
          / "d"
          ;


This grammar does not allow "a" "b" "c" as an input. What tells you
that? In contrast, if you pass this grammar to an LL tool like (ANTLR)
it should tell you that their is a "non-determinism" on "a" in the
part1 rule or that you need LL(3) to compute it. Similarly, with an LR
tool, you will get a shift-reduce warning or that you need LR(2) to
compute it.


However, with a PEG, you need something that looks at the grammar and
tells you that you need something like:


part1: "a" b" (! "c" )
          / "a"
          ;


And, I haven't seen any PEG tools that give you that warning. Perhaps
they do and I just don't know enough about them. However, I suspect
that generally giving such a warning is impossible, because it
requires determining the intersection of two PEG grammar fragments
(i.e. part1-> "a" "b" and what follows vs. part1->"a" and what
follows).


> P.S. My limited experience suggests another possible benefit to OCP
> grammar writing. The ambiguities that cause trouble for an OCP
> usually look to me like they would create a lot thrashing, dead ends
> and backtracking for an exhaustive parser. Working out those kinks
> in the grammar not only makes it work with an OCP, but might also
> greatly improve it for use by any other parser as well. Any comments
> on this?


I think working out those kinks is good. I'm just not sure how one
reliably finds those kinks without tool support. (Perhaps someone has
a reliable method, I'm willing to discuss it if they think they do.)


And, before one says this looks like a strawman. I want to relate a
real-life case that happened to me recently. I have my own
hand-written copy of unix2dos that I've lost the source code to, but
which has some features I can't get elsewhere (some variants on
Multics =-name matching that allow me to convert C++ filenames easily
between different platforms). As result, I sat down to rewrite one,
using a grammar as the basis for the internal text within lines.


For this I wrote a grammar that tried to handle whitespace at the end
of the line specially, such that while doing the \r \n conversions, I
could also trim trailing whitespace on a line. Therein lies the rub,
the "trivial" grammar I wrote to express my intent, was not conflict
free. (I will post this later if people are interested, along with
more explanation.) Moreover, naively adding the stock precendence
declarations, did resolve the conflicts, but produced a grammar that
didn't parse the text that I wanted parsed.


This is another example of what I was trying to express above. The
rules of the grammar describe what I wanted. However, the grammar
wasn't LR. Forcing it to be treated as such by precedence rules,
which are another way of forcing choices, simply generated an
incorrect parser. I should have listened to the warnings and
re-written the grammar, which of course is what I did after quick
investiagation showed me that I was being too naive and cavalier in my
grammar writing.


The question which I ask is: without those warnings, would I
have found my grammar errors and would I known which rules were the
suspect ones? My feeling is that I would have eventually done so, but
not in the 15 minutes it took me.


Particularly worth noting to me is that I did this exactly the way
Lowell suggested one should write grammars (which I generally think is
the right way). I had some concepts of the kinds of strings I wanted
to parse: line bodies with embeded whitespace and line ends with
preceding whitespace. And, I could easily come up with rules
describing each. However, the composition of those two rules yielded
an ill-formed grammar, at least for deterministic linear time
parsing--the grammar was GLR, but then I wouldn't have known whether
it was unambiguous (well, actually I know it wasn't but wouldn't have
known so if it had been more complex). I'm much happier having
rewritten the grammar so that it is comfortably LR with no precedence
directives. I would have been happier if I could have made it LL.


That is the point, one wants to write grammars fairly naively. But,
then, one wants a formal tool to check and prove one hasn't composed
nonsense. We don't have good tools for that beyond LL and LR. As far
as I know, we specifically don't have good tools for that when writing
OCP/PEG/predicated grammars and we don't have good tools for the for
GLR grammars either. Therefore, whenever possible try to write an LL
or LR grammar first, and only when you can't figure out how to write
such a grammar, escape to the less secure methods.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.