Re: How to rewrite a regexp without word boundaries?

Chris F Clark <cfc@shell01.TheWorld.com>
Mon, 13 Jul 2009 03:26:02 -0400

          From comp.compilers

Related articles
How to rewrite a regexp without word boundaries? dave_140390@hotmail.com (2009-07-05)
Re: How to rewrite a regexp without word boundaries? h.b.furuseth@usit.uio.no (Hallvard B Furuseth) (2009-07-05)
Re: How to rewrite a regexp without word boundaries? dave_140390@hotmail.com (2009-07-06)
Re: How to rewrite a regexp without word boundaries? haberg_20080406@math.su.se (Hans Aberg) (2009-07-07)
Re: How to rewrite a regexp without word boundaries? h.b.furuseth@usit.uio.no (Hallvard B Furuseth) (2009-07-07)
Re: How to rewrite a regexp without word boundaries? andrew@tomazos.com (Andrew Tomazos) (2009-07-07)
Re: How to rewrite a regexp without word boundaries? cfc@shell01.TheWorld.com (Chris F Clark) (2009-07-13)
Re: How to rewrite a regexp without word boundaries? hu47121@usenet.kitty.sub.org (2009-08-16)
Re: How to rewrite a regexp without word boundaries? dot@dotat.at (Tony Finch) (2009-08-16)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers,comp.theory
Date: Mon, 13 Jul 2009 03:26:02 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 09-07-003 09-07-004 09-07-008 09-07-010
Keywords: lex
Posted-Date: 13 Jul 2009 06:55:46 EDT

I think Hallvard B Furuseth and Hans Aberg are close to a solution,
since the intersection of 2 regular expressions is a regular
expression (at least for automata theoretic ones), for Perl ones with
captures and backreferences I don't know if that holds (or not).
However, since you are generating an FA, you're dealing with automata
theoretic ones and are safe.


However, if I were attacking this problem, and I might have reason to
over the next year (so I would be glad to correspond further on it), I
would do some case-analysis to prune off easy cases. So, we'll call
our original RE A\bB, such that A & B are both regular expressions.


If your analysis of A or B can tell you what characters A can end with
or B can begin with (and if you are making an FA, you should be able
to get that information), since there has to be a set of states that
correspond to where the \b is to be matched. The edges into and out of
that state will be labelled with the valid characters upon which the
transition can occur. If the ones leading in have characters in the
\w set, the ones leading out must have characters in the \W set and
vice-versa. Note, if one or the other side has characters from both
sets \w and \W, split the state into 2 states, with the \W characters
leading to 1 state and the \w characters leading to the other. I'll
leave the rest of this idea as an exercise to the reader. I'm sure
there are some thorny details that I haven't worked out.


Hope this helps,
-Chris


******************************************************************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.