Re: Alternatives to Regexps

Chris F Clark <cfc@world.std.com>
13 Jul 1998 23:44:40 -0400

          From comp.compilers

Related articles
[4 earlier articles]
Re: Alternatives to Regexps d.rourke@arpc.com (Daniel Rourke) (1998-07-10)
Re: Alternatives to Regexps bear@sonic.net (Ray Dillinger) (1998-07-10)
Re: Alternatives to Regexps bpr@best.com (Brian Rogoff) (1998-07-10)
Re: Alternatives to Regexps mav@naxos.esat.kuleuven.ac.be (Maurizio Vitale) (1998-07-10)
Re: Alternatives to Regexps lord@emf.emf.net (1998-07-11)
Re: Alternatives to Regexps bromage@cs.mu.OZ.AU (1998-07-11)
Re: Alternatives to Regexps cfc@world.std.com (Chris F Clark) (1998-07-13)
Re: Alternatives to Regexps torbenm@diku.dk (Torben Mogensen) (1998-07-13)
Re: Alternatives to Regexps thetick@magelang.com (Scott Stanchfield) (1998-07-20)
| List of all articles for this month |

From: Chris F Clark <cfc@world.std.com>
Newsgroups: comp.compilers
Date: 13 Jul 1998 23:44:40 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 98-07-057 98-07-075
Keywords: DFA

John Carter (ece@dwaf-hri.pwv.gov.za) wrote:
>I'm seeking pointers to alternatives to standard regular expression
>languages.


To which, Torben Mogensen <torbenm@diku.dk> replied:
>You can use context free grammars also at the lexical level.


First, I would like to second this notion (at least for parsing
applications). The lexer generator which is part of Yacc++ uses LR
grmmars to perform lexical analysis. The ANTLR 2.x toolkits use LL
grammars in their lexer. There has been at least one paper on
foliding lexing analysis into parsing and skipping a lexer
altogether--the anagram parser follows this model. There are probably
other examples that I have forgotten. (By the way, all of the tools I
mentioned support the normal regular expression operators in their CFG
notations.)


The one issue in using some form of CFG for lexical analysis is that
most tools which augment the finite state machine with an auxillary
data structure (i.e. some kind of parse stack) are a bit more
complicated to backtrack. A simple regular expression model (like the
one in LEX) is easy to backtrack in, which makes the longest match
rule very intuitive in LEX. Therefore, if you are using a CFG tool
make certain that it supports any backtracking you may need, and that
you can understand how the backtracking works and what it will match
for a given set of rules.


By the way, in the lexing arena most of the regular expression
utilities do follow some formal model (regular expressions, LL, or LR
parsing) without quirky extensions (that is without extensions that
cannot be formally rewritten into formal model--even things like
yyless can be mapped if one is careful). For compiler writing, the
ability to understand exactly what is being lexed is generally worth
giving up some expressive power that might be gained by clever but
quirky extensions.


Second, it is worth mentioning that the regexp packages in Emacs,
Perl, grep are using their extended regular expressions for a
different purpose than lexing. In fact, there are at least three
distinct contexts where some form of regexp is useful.


1) lexing as covered above
2) searching for a substring in a string (emacs, grep)
3) comparing a string with a pattern (perl, snobol)


Part of the reason that the regexps have quirky extensions in the
second and third uses is that these extensions allow the problems in
these areas to be addressed using regexps (where simple regular
expressions would not be adequate). The most obvious example which
comes to mind is the use of back references (things like \(...\) ->
\1). If the regexps of emacs and perl did not allow back references,
they would probably have to extend their regular expressions to
include non-terminals like CFGs have. The resulting notation would
probably more cumbersome and less useful (particularly in one-shot
applications).


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : cfc@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
------------------------------------------------------------------------------
Web Site in Progress: Web Site : http://world.std.com/~compres
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.