Re: ANTLR Grammar Question

Chris F Clark <cfc@shell01.TheWorld.com>
Wed, 16 May 2007 10:45:48 -0400

From comp.compilers

Related articles
ANTLR Grammar Question akhailtash@gmail.com (Amal) (2007-05-14)
*Re: ANTLR Grammar Question cfc@shell01.TheWorld.com (Chris F Clark)* (2007-05-16)**
Re: ANTLR Grammar Question idbaxter@semdesigns.com (Ira Baxter) (2007-05-18)

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	Wed, 16 May 2007 10:45:48 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	07-05-052
Keywords:	PCCTS, lex, comment
Posted-Date:	17 May 2007 02:07:59 EDT

Amal <akhailtash@gmail.com> writes:

> TK_TEXT
> :
> ( {input.LA(2)!='e' && input.LA(3)!='n' && input.LA(4)!='d'}?=>
> '$'

I think your problem is here you want || not &&. If this is the case,
more below.

> | ' '
> | '\t'
> | '\r'
> | '\n'
> | ~('$' | ' ' | '\t' | '\r' | '\n')
> )+
> ;

This is a very typical thorny problem with regular expressions. It
comes up all the time when you want to swallow all the text (sigma*) upto
some multiple character end-marker. Another example of this is C
comments which can contain any character sequence except "*/" (a two
character end-marker). And the problem gets a little worse when you
want to do things with some character sequences (e.g. count newlines
or translate multi-character sequences like the backslash convention
the four characters \ x f c become one character \xfc) within the
stream.

In theory, one should just write:
      TK_TEXT: .* - "$end" ; // . is the typical notation for sigma

That works (see below) in theory because regular expressions are
closed under complement and interestion, thus you can take the set
difference. Unfortunately, very few (perhaps no) lexer/parser
generators implement a difference operator (directly).

Predicates, as you have used, are the often supported substitute,
since they support intersection. However, you want a "negative"
predicate (i.e. do this only if not that) and ANTLR may not support
those yet. Thus, you would like to write:

      TK_TEXT: ("$end" << .)* ; // << is the Yacc++ operator for
                                                          // a negative predicate

So, you need a positive predicate more like:

      TK_TEXT: (~"$end" >> .)* ; // >> is the Yacc++ operator for
                                                            // a positive predicate

Unforutnately, in most tools ~(complement character) does not
distribute over strings. Thus, ~"$end" is either not accepted by ones
tool or means ~$ "end" (this is it only complements the first
character.

Thus, you tried to calculate this by hand, and you made the same
mistake we all make (at least from time to time) and forgot (or didn't
know) to apply DeMorgan's law: ~(a and b) becomes (~a || ~b) and wrote:

      TK_TEXT: (~$ ~e ~n ~d >> .)* ;

When what you needed was:

      TK_TEXT: ((~$ | "$" ~e | "$e" ~n | "$en" ~d) >> .)* ;

BTW, when you have this, you don't acutally need a predicate and can
write:

      TK_TEXT: (~$ | "$" ~e | "$e" ~n | "$en" ~d)* ;

Now, this is a very common idiom and we actually have special notation
in Yacc++ just for this case (and had it before we supported
predicates or not rangesw), and encouraged people to write:

      TK_TEXT: ((((@* | "$")* | "e")* | "n")* | "d");
                        // @ is a special kind of sigma that doesn't swallow characters
                        // that are processed on the "other" branches.

BTW, one of the workarounds for lack of predicates are start states
(or lexer classes). And, you will find examples of how to handle C
comments in LEX using start states.

below: sigma* - "$end" doesn't really work, because it doesn't mean
the quite right thing. It means any string but $end and not any
string but. Closer to the right meaning is:

      TK_TEXT: .* - (.* "$end") ;

This would work if $end was the very last characters of the input.
Even more correct, would be:

      TK_TEXT: .* - (.* "$end" .*) ;

This means any string that doesn't contain "$end" (and if you combine
that with the longest match rule that is often used for regular
expressions), it probably specifies the right thing. However, it is
probably clearer to use a predicate formulation.

However, what should work for tools that have "lookahead" operators
ala LEX, is:

      TK_TEXT: .* / "$end" ; // / is the lookahead operator in LEX

This should mean parse sigma*, but stop when you get a lookahead of
"$end". Note, to make this completely correct, you probably need the
non-greedy *? operator of perl, so that you stop at the first $end. In
retorspective, this would probably be the clearest expression of all,
saying exactly what you meant (if you could only say it that way in
ANTLR).

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

[The underappreciated re2c lexer generator has a difference operator. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: ANTLR Grammar Question

Chris F Clark <cfc@shell01.TheWorld.com>Wed, 16 May 2007 10:45:48 -0400

Chris F Clark <cfc@shell01.TheWorld.com>
Wed, 16 May 2007 10:45:48 -0400