Re: Precedence Rules for '$' and '^'

jamin.hanson@googlemail.com
Mon, 17 Sep 2007 00:46:54 -0700

          From comp.compilers

Related articles
Precedence Rules for '$' and '^' jamin.hanson@googlemail.com (2007-09-12)
Re: Precedence Rules for '$' and '^' jo@durchholz.org (Joachim Durchholz) (2007-09-13)
Re: Precedence Rules for '$' and '^' jo@durchholz.org (Joachim Durchholz) (2007-09-13)
Re: Precedence Rules for '$' and '^' jamin.hanson@googlemail.com (2007-09-14)
Re: Precedence Rules for '$' and '^' rsc@swtch.com (Russ Cox) (2007-09-14)
Re: Precedence Rules for '$' and '^' jo@durchholz.org (Joachim Durchholz) (2007-09-15)
Re: Precedence Rules for '$' and '^' cfc@shell01.TheWorld.com (Chris F Clark) (2007-09-17)
Re: Precedence Rules for '$' and '^' jamin.hanson@googlemail.com (2007-09-17)
| List of all articles for this month |

From: jamin.hanson@googlemail.com
Newsgroups: comp.compilers
Date: Mon, 17 Sep 2007 00:46:54 -0700
Organization: Compilers Central
References: 07-09-03507-09-048 07-09-052 07-09-061
Keywords: lex
Posted-Date: 18 Sep 2007 08:10:16 EDT

On 15 Sep, 21:41, Joachim Durchholz <j...@durchholz.org> wrote:
> jamin.han...@googlemail.com schrieb:
>
> > However, the real question is that if you allow '^' and '$' to occur
> > anywhere in a regex (boost::regex works this way),
>
> I may be missing something, but it seems to me that such a rule
> wouldn't match anything if it has a nonempty pattern before the ^ or
> after the $.
>
> I.e. asd^jkl, while a perfectly valid regex, will never match, or will it?


No that regex won't ever match.


> > how you handle '^'
>
> > and '$' clashes, because you may have declared a '$' rule before a '^'
> > rule, yet my code always checks '^' before '$' regardless.
>
> Some example would help.


I will have to hunt around for some good examples! Do you know of any
example lex specs what actually use ^ and $ at all?


>Rule order definitely doesn't affect lexing,
> at least not unless you provide mechanisms that go beyond regular
> expressions.


Yes they do! Rule ordering is used to determine which rule matched in
the event of ambiguity. That's why the more general rules such as [a-
z]+ (for example) come AFTER keywords in lex specs.


> > As you have to check both possibilities on lookup (otherwise how can
> > you ever match them ;-) ), the right thing to do appears to be to
> > suppress the '^' if it occurs at a position in the rules that a '$'
> > has already occurred at.
>
> In my book, the Right Thing would be to spit out a warning somewhere.


As I mentioned, boost::regex deals with ^ and $ anywhere in a regex
without complaining.


> > Note that using PERL rules is not the answer, as lexers use left-most
> > longest and compile to a DFA. PERL uses leftmost precedence and uses
> > NFA.
>
> I'm not sure how "leftmost longest" and "leftmost precedence" relate.


Here's the difference: [-+]?(\d+[.]?|\d*[.]\d+) in PERL will match 0.
when presented with 0.1


This is because it gives up as soon as the leftmost rule matches.
This is in contrast with a leftmost longest strategy (typically used
with a DFA representation) which will keep going until it can't match
anymore regardless of the ordering of the ored expressions.


> DFA and NFA are certainly not at odds, since they are equivalent.


In theory, but not the way PERL implements lookup.


> I'd still look at Perl whenever I'm unsure, simply because that's how
> most people expect regexes to work. That doesn't mean you have to do
> everything as Perl does, particularly if you have good reasons to do
> otherwise :-)


PERL is the wrong model for lexers. Extended POSIX is closer.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.