Re: regular expression operators in CF grammars

"Sönke Kannapinn" <>
10 Aug 2002 02:13:52 -0400

          From comp.compilers

Related articles
[13 earlier articles]
Re: regular expression operators in CF grammars (Joachim Durchholz) (2002-07-02)
Re: regular expression operators in CF grammars (=?Windows-1252?Q?S=F6nke_Kannapinn?=) (2002-07-04)
Re: regular expression operators in CF grammars (SLK Parsers) (2002-07-04)
Re: regular expression operators in CF grammars (Ralph Boland) (2002-07-04)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-07-15)
Re: regular expression operators in CF grammars (2002-07-21)
Re: regular expression operators in CF grammars (Sönke Kannapinn) (2002-08-10)
Re: regular expression operators in CF grammars (Sönke Kannapinn) (2002-08-10)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-08-23)
Re: regular expression operators in CF grammars (tj bandrowsky) (2002-09-03)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-09-12)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-09-14)
RE: regular expression operators in CF grammars (Quinn Tyler Jackson) (2002-09-14)
| List of all articles for this month |

From: "Sönke Kannapinn" <>
Newsgroups: comp.compilers
Date: 10 Aug 2002 02:13:52 -0400
Organization: [Posted via] Germany GmbH
References: 02-05-142 02-05-151 02-06-024 02-06-056 02-06-090 02-07-010 02-07-025 02-07-043
Keywords: parse
Posted-Date: 10 Aug 2002 02:13:51 EDT

Chris F Clark writes:
> > ... "Regular Expressions into Finite Automata" by Anne
> > Brueggemann-Klein (TCS 120, 1993).
> ... can I have a little better bibliographic data ...?

Sorry for replying with great delay. (School summer holidays in
Berlin.) And sorry for my having been too brief. I meant

@article{ Brueggemann-Klein:93
, author = {Br\"ugge\-mann{-}Klein, Anne}
, title = {Regular Expressions into Finite Automata}
, journal = tcs
, year = 1993
, volume = 120
, pages = {197--213}

where "tcs" abbreviates "Theoretical Computer Science".

In general, I think it could be helpful if I try to clarify a few
points on ELR parsing in this posting.
Let me come back to a statement of your's from your second-latest
posting where you talked about ambiguous rhs regular expressions.

> There are several possibilities for dealing with that problem.
> One of the simplest, is not attempting to distinguish where the
> constituents come from, after all the constituents do *not*
> represent non-terminals, but merely fragments of the rhs
> definition of a non-terminal. As I said, it is possible to
> consider the erasure of such boundaries a fundamental part of
> regular expressions.
> More importantly, the erasure of the boundaries is important for
> regular expressions to have the ability to decrease look-ahead
> requirements.

I would like to say two things here:

1. I do not really accept that in compiler construction, rhs regular
expression constituents do not represent non-terminals but merely
fragments of the rhs definition. From an abstract point of view,
compilers realize functions that assign each sentence of some source
language a (semantically equivalent) sentence of some target language
(be it machine language or some intermediate language). Now, source
languages contain an infinite number of sentences, and we want to
model these functions with finite means. So there seems to be no
other useful way to do this than to use structural induction. An
unambiguous finite context-free grammar is a useful skeleton.

Looking at EBNF grammars, we also have a finite number of grammar
rules. But each rhs regular expression itself describes a potentially
infinite language (of rhs expansions). So, again, when we want to
assign semantics with finite means we need a useful finite and
unambiguous structural skeleton. And the rhs regular expressions'
natural inductive structure is perfect for that - why should we
deliberately forget about it?

So in my eyes, it is more or less natural to ask for rhs regular
expressions to be (strongly) unambiguous.

2. Regular expressions do *not* establish additional boundaries with
respect to look-ahead requirements if you choose the ecfg-to-cfg
transformation scheme in a way that is adequate for the LR parsing
technique: The decision whether to reduce according to some EBNF rhs
regex must be postponed until it is clear that the *whole* regex
matches the next input. This is achieved by transforming, e.g., an
ecfg rule containing a regular *-operator

A -> <alpha> <beta>* <gamma>

into cfg rules

A -> <alpha> A'
A' -> <beta> A'
A' -> <gamma>

The cfg obtained through transformation from the ecfg "applies" the
*-operator (by reducing A' -> <beta> A') only after seeing <gamma>
in the input.

You may want to look up the work of Madsen and Kristensen (1976) for a
complete LR-adequate ecfg-to-cfg transformation scheme. Other
transformation schemes are also conceivable as long as they
essentially yield unambiguous right-linear sub-grammars - cf.
Heilbrunner (1979).

A "right-biased" transformation strategy such as the one sketched
above does *not* introduce new look-ahead boundaries "in the middle"
of rhs regex's. All the (sub-)reductions that "apply" regex operators
happen "at the end" of the regex using the look-ahead requirements
there. With respect to the ecfg, the reduction of a (complex) handle
to an ecfg rule's lhs will be performed by a sequence of related
(sub-)reductions of the according cfg; these sub-reductions
effectively constitute a sort of backward walk through the rhs regular
expression from which it can be derived *how* the handle is matched by
the regex's inductive operator structure. Therefore, semantic actions
can be associated with the regex's operators. Reconstruction of any
handle's inductive structure is always uniquely possible as long as
all rhs regex's are (strongly) unambiguous.

Now let me comment on the example grammar parts that you presented in
your latest posting.

> > You only gain in power if you allow *ambiguous* rhs regular
> > expressions (non-stronly-unambiguous, to be precise), or ambiguous
> > rhs NFA's. Are you able to write down a compiler-related example
> > where you really make wise use of this freedom?
> [...] I can describe the rough outline of what
> the problem was and why having an ambiguous regular expression was a
> key to solving it.
> The original grammar had productions something like the following:
> obj_id: id | obj_id "." id; // a recursive definition of id ("." id)*
> expr: obj_id
> | simple_func
> | obj_func
> | expr "+" expr
> ...
> ;
> simple_func: id "(" asgn_list ")"; // simple funcs have complex params
> asgn_list: (obj_id "=" expr ";")+;
> obj_func: obj_id "(" expr_list? ")"; // obj funcs have simple params
> expr_list: expr ("," expr)*;
> The problem is this grammar isn't LR(k) for any k. [...]

So far, so good.

> The solution that seemed
> best was to use regular expressions (relying on their ambiguity) and
> replace the following two rules:
> obj_id: id ("." id)*;
> obj_func: id ("." id)* "(" expr_list? ")";
> Doing so, we pushed the lookahead requirements to the point where the
> parser generator could resolve them. More importantly, we did so with
> minimal changes to the structure of the grammar. There may be an
> LR(1) grammar for the language in question. However, I doubt that it
> reflects the author's intended structure as well as these minor
> modifications did. Moreover, we could recover the structure by using
> a grammar predicate, getting the best of both worlds.
> obj_func: id ("." id)* "(" expr_list? ")" >>
> obj_id "(" expr_list? ")";
> Since, I don't presume you are familiar with the notation I use for
> predicates. The predicate says, if you can match with the regular
> expression before the >> operator, apply the regular expression after
> the >> operator. Thus, we pattern match with the ambiguous rule and
> then apply the non-ambiguous rule to have a nice representation.

I think I understand what you did here and why you did so. I will
not comment on your concept of "grammar predicates" here. But as far
as I understand "regular expression ambiguity", your grammar doesn't
contain any such ambiguity. To me, from what you have said, it seems
you believe that

obj_func: id ("." id)* "(" expr_list? ")"

has an ambiguous rhs regular expression. With my understanding of
"regular expression ambiguity", this is not the case. Any regex
"a (b c)* d e? f" is clearly strongly unambiguous because, for
each of the regex's extensions, there is always only one unique way
how to apply the regex's operators in order to produce an extension.

(In contrast, "a**" is weakly unambiguous, but not strongly; and
"a* a*" is neither weakly nor strongly unambiguous.)

> However, I can still see the advantage to treating regular
> expression operators as ambiguous in some cases.

I also do not really understand what you mean by "treating regular
expression operators as ambiguous".

Anyway, I hope it has become a little clearer what *I* mean by
"unambiguous regular expression" and what my idea of an "ELR(k)
grammar" is.

BTW: The (corrected) grammar that you presented is in fact an
ELR(1) grammar as far as I (and Madsen and Kristensen 1976)
understand that term. So an ELR(1) parser is able to unveil the
inductive structure of all handles during reductions and make it
accessible to the user's semantic action code.

> Thanks for the stimulating exchange,
> -Chris

Thank you, too.

Note: The short references to literature that I used in this
posting refer to the full references provided at the end of my
earlier posting.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.