A big easy subset of the CFGs

"Matt Timmermans" <mtimmerm@microstara.com>
18 Jun 1998 14:28:58 -0400

From comp.compilers

Related articles
*A big easy subset of the CFGs mtimmerm@microstara.com (Matt Timmermans)* (1998-06-18)**
Re: A big easy subset of the CFGs mickunas@cs.uiuc.edu (1998-06-19)
Re: A big easy subset of the CFGs cfc@world.std.com (Chris F Clark) (1998-06-19)
Re: A big easy subset of the CFGs mtimmerm@microstar.no-spam.com (Matt Timmermans) (1998-06-24)
Re: A big easy subset of the CFGs scavadini@hotmail.com (Salvador Valerio Cavadini) (1998-06-24)
Re: A big easy subset of the CFGs clark@quarry.zk3.dec.com (Chris Clark USG) (1998-07-01)

| List of all articles for this month |

From:	"Matt Timmermans" <mtimmerm@microstara.com>
Newsgroups:	comp.compilers
Date:	18 Jun 1998 14:28:58 -0400
Organization:	IGS - Information Gateway Services
Keywords:	parse, question

Hi All,

I need a parser generator, and I'll probably have to write it myself,
because I need it to handle a subset of CFGs that I've never heard of
before, and may have to (more daunting) design myself.

If anyone knows of some appropriate class of languages that has been
studied in the literature and matches my requirements, thus freeing me
from the task of designing it myself, I would be most grateful.

What I need is this:

A subset of CFGs bigger than LR(1). In particular, it should include most
of what you can do with LR(1)+lex, using characters as terminals.

Ambiguity of a grammar has to be decidable in randomized polynomial
time, i.e., including regular expression intersections is OK, even
though it goes exponential in the worst case, because it almost never
does.

It should have non-bizarre ambiguity rules so that I can make
comprehensible error messages. For instance, LR(1) has non-bizarre
ambiguity rules, but LALR(1) does not.

For LR(1), I can say "It must be possible recognize a production
without looking at more than 1 token following the production". Then
I can give error messags for shift-reduce conflicts like "If the
<non-terminal> in rule <rule> begins with <example> followed by a
<terminal>, then it is not possible to decide whether
<tail-of-the-example> is a <non-terminal> in one of <rules>, or if
<example-tail+terminal> is part of one of <rules>".

Or something like that.

The point is that I can't report LALR(1) conflicts that are not also
LR(1) conflicts in the same way, because LALR(1)-only conflicts only
make sense if you know how LALR(1) parser generators "merge" LR(1)
states. That makes LALR(1) ambiguity rules "bizarre". I need to be
able to make nice error messages without reference to objects (like
states) that don't exist in the grammar the user wrote.

Is there anything out there that meets these three requirements?

I have something in mind that meets my requirements and that I can
probably manage with much work. References previous studies of this
idea, or even a common name for it, would help immensely.

The idea is like LR(1), but rather than looking ahead a single token,
it looks ahead a single non-terminal or terminal, i.e., "It must be
possible to recognize a production without looking past the following
term (in the rules in which it appears)". This meets the criteria,
but produces an unwelcome effect in naive implementations -- the
parser might not be able to recognize anything until it gets to the
end of the input, at which point the last/deepest non-terminal gets
reduced and causes a huge reduction cascade back to the beginning.
I'm willing to live with that, because small reduction cascades are
necessary to do LR(1)+lex, and I think I can get rid almost all large
ones with a sufficiently clever implementation.

This LR(term) idea is too obvious to be new. Does anyone know what
it's really called?

--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

A big easy subset of the CFGs

"Matt Timmermans" <mtimmerm@microstara.com>18 Jun 1998 14:28:58 -0400

"Matt Timmermans" <mtimmerm@microstara.com>
18 Jun 1998 14:28:58 -0400