Re: Regular expression search algorithm

"Aharon Robbins" <arnold@skeeve.com>
3 Sep 2002 00:32:34 -0400

          From comp.compilers

Related articles
Regular expression search algorithm harvinder.singh@patni.com (Harry) (2002-07-15)
Re: Regular expression search algorithm joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Re: Regular expression search algorithm ralph@inputplus.co.uk (Ralph Corderoy) (2002-07-21)
Re: Regular expression search algorithm peter.tillier@btinternet.com (Peter S Tillier) (2002-07-24)
Re: Regular expression search algorithm arnold@skeeve.com (Aharon Robbins) (2002-09-03)
regular expression search algorithm forsyth@minster.york.ac.uk (1993-03-18)
Re: regular expression search algorithm pardo@cs.washington.edu (1993-03-19)
| List of all articles for this month |

From: "Aharon Robbins" <arnold@skeeve.com>
Newsgroups: comp.compilers
Date: 3 Sep 2002 00:32:34 -0400
Organization: Pioneer Consulting, Ltd.
References: 02-07-036 02-07-069 02-07-102
Keywords: lex
Posted-Date: 03 Sep 2002 00:32:34 EDT

Peter S Tillier <peter.tillier@btinternet.com> wrote:
> awk, egrep and others usually use DFAs, which tend to be slower to compile
> and faster to execute than NFAs, but don't support back references.
> gawk uses both - DFA most of the time, but NFA when back references
> are needed.


Gawk doesn't do back references. (Regexes of the form (abc)foo\1 )
They're not in the awk language.


Gawk uses dfa matching only for the cases of:


x ~ y
x !~ y


Other matches, such as split(), match(), and regex-based field and record
splitting, require knowing the start and end position of the match,
which the dfa matcher doesn't provide.


The dfa code is that from GNU grep/egrep, and the NFA code is GNU regex.
(Which I think does NFA stuff deep down under the hood, but as the code
is A Big Ball Of Mud, I'm not sure. I treat it strictly as a black box,
and I loathe the interface.)


Fortunately, there's improvement on the horizon. GNU Libc now has a
written-from-scratch version of the GNU regex routines that attempt to
use a dfa-based algorithm where possible, and which falls back to an
NFA algorithm when not. This code can be gotten from the glibc public
CVS archive at sources.redhat.com.


In the in-testing version of gawk I've removed the dfa code entirely and
now use the new version of regex. (The code in gawk is now cleaner as
a result, which is a pleasant side-effect.) Said version is currently
undergoing the trial-by-fire of portability to different Unix systems,
non-Unix systems, and various compilers. So it's getting closer to
being ready for prime time, but isn't quite there yet.


For what it's worth. :-)


Arnold Robbins
gawk's maintainer
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold@skeeve.com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 928 569 9018
Nof Ayalon Cell Phone: +972 51 297-545
D.N. Shimshon 99785 ISRAEL


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.