20 Sep 1999 12:05:38 -0400

Related articles |
---|

"Regular expressions" for stack automata? Marko.Makela@HUT.FI (Marko =?ISO-8859-1?Q?M=E4kel=E4?=) (1999-09-10) |

Re: "Regular expressions" for stack automata? Marko.Makela@HUT.FI (Marko =?ISO-8859-1?Q?M=E4kel=E4?=) (1999-09-16) |

Re: "Regular expressions" for stack automata? world!cfc@uunet.uu.net (Chris F Clark) (1999-09-20) |

Re: "Regular expressions" for stack automata? ilya@math.ohio-state.edu (1999-09-24) |

Re: "Regular expressions" for stack automata? qjackson@wave.home.com (Quinn Tyler Jackson) (1999-10-04) |

From: | Chris F Clark <world!cfc@uunet.uu.net> |

Newsgroups: | comp.compilers |

Date: | 20 Sep 1999 12:05:38 -0400 |

Organization: | The World Public Access UNIX, Brookline, MA |

References: | 99-09-030 99-09-050 |

Keywords: | lex, parse |

Marko (answering himself) wrote:

*> Marko> Now, my questions: Has this field been researched? Is there*

*> Marko> any grep-like tool that uses stack automata instead of finite*

*> Marko> state automata?*

*>*

*> I was pointed out in an e-mail message that QED (the ancestor of ed*

*> and vi, and probably the first tool that introduced regular expression*

*> based search and replace functions) has this. The construct is very*

*> simple: named regular expressions. From the QED manual (see*

*> <URL:http://cm.bell-labs.com/cm/cs/who/dmr/qed.html>):*

*>*

*> The regular expression "\E(bal)" defined by*

*>*

*> e(bal)/[^`']*(`\E(bal)')*[^`']*/*

*>*

*> matches any string balanced with respect to single quotes "`" and "'".*

*>*

*> By the way, I was very impressed that the regular expressions have*

*> remained practically unchanged for a so long time.*

*>*

*> John> [Many implementations of REs such as GNU grep extend the*

*> John> REs in interesting ways with backreferences. -John]*

*>*

*> True, but I haven't seen anything like the \E(re) construct of QED*

*> anywhere, and \1..\9 cannot express the same thing. I wonder how well*

*> Perl regular expressions would perform with this construct. I would*

*> expect especially the non-greedy closure operators *? and +? to be*

*> utterly inefficient with this construct.*

You will find the \E(bal) concept only in lexer generators that use

something like LL or LR parsing technology to lex. There are at least

two examples of this that I know of, ANTLR 2.x (LL lexing) and Yacc++

(LR lexing).

The reason you will only find the extension in lexers generated by

these more powerful tools is that you need a stack automata and

"regular expressions" (as opposed to regexp's, i.e. what perl, emacs,

and grep have) are defined mathematically to be equivalent to simple

finite state machines with no stack. Finite state machines with a

stack are equivalent to context free languages. (And there is plenty

of studies of CFL's, but not listed under "regular expressions" since

to the theorist they are different beasts.)

Now, regular expressions (RE's) haven't seemed to advance in a long

time because it has been proven that all regular expressions can be

reduced to simple FSM's and vice-versa so no additional operators

would theoretically be necessary (or useful) to someone studying them

as automata. As a result, most interesting extensions to RE's don't

get press in the books that teach them (automata or compiler) as the

extensions are simply equivalent to something already covered (or they

aren't RE's at all).

However, if you read works in the area of computational linguistics

you will find that there is a more lively interest in regular

expressions. The linguists use RE's to model language translations

and have systems that use a variety of extensions to RE's. (They are

often interested in FST's (finite state transducers), which are

essentially RE's with outputs or "synchronized" FSM's where two

machines (one for the input and one for the output) are

interconnected.)

Now, let's return to regexp's (i.e. what perl, grep, and emacs have).

You are right that the backreference capability is not the same as

stack automata. In fact, the two facilities are orthogonal. Both

facilities solve different problems (and there are expressions you can

write in one that you can't write in the other). Equally important is

that the backreference capacity (like the stack) takes you out of the

domain of pure regular expressions. That is you can't translate a

backreference into a set of states in a FSM (at least not at regexp

compile time). In fact, you will find that most regexp packages do

not generate FSM's (like most lever generators do) but instead

"interpret" the regexp while doing the match. If you use this

approach, the backreference facility is easy to implement. That's

because you don't even try to generate an FSM for it, but merely do

the match against the appropriate backreferenced string at the correct

stage of the interpretation.

The use of stack automata goes the other way. Stack automata are

usually implemented by converting the regular expressions to FSM's.

That does not mean that a regexp package could not include a stack and

push things onto it at the appropriate points in the interpretation.

However, one often wants to take more care with stack automata--they

introduce ambiguity problems that one would like to know about before

running them on an input and getting an incorrect parse (i.e. the

regexp package chose the wrong ambiguous case to match with). As far

as I know, the only way of detecting those ambiguities is to run them

through an appropriate parser generator and seeing if it detects

conflicts. Even that method is not precise as most parser generators

will reject some unambiguous grammars that are simply beyond the

generator's resolution capacity. There is a proof, in fact, that the

problem cannot be solved in general and that one must either accept

some ambiguous grammars or reject some unambiguous grammars.

The ambiguity problem does not plague pure regular expressions,

because all regular expressions that match the same set of strings are

considered equivalent (and there is no feature in regular expressions

to distinguish how the parts of the regular expression were lined up

against the string).

However, a similar ambiguity problem also plagues backreferences, but

most regexp packages sweep it under the rug. They have the ambiguity

problem because the backreferences allow one to determine how the

parts of the regexp lined up against the string. That breaks the

equivalence of the underlying regular expressions (i.e. "a \(a*\)" is

not equivalent to "\(a+\)"). Note that the perl regexps introduce the

non-greedy verions of the regular expression operators to give the

user control over how the underlying regular expressions line up.

This is probably more that you wanted to know.

-Chris

*****************************************************************************

Chris Clark Internet : compres@world.std.com

Compiler Resources, Inc. CompuServe : 74252,1375

3 Proctor Street voice : (508) 435-5016

Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

------------------------------------------------------------------------------

Web Site in Progress: Web Site : http://world.std.com/~compres

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.