Fri, 19 Mar 1993 19:09:38 GMT

Related articles |
---|

regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14) |

Re: regular expressions (bug-report) Paul.Klint@cwi.nl (1993-03-15) |

Re: regular expressions (bug-report) mab@wdl39.wdl.loral.com (1993-03-15) |

regular expressions (bug-report) jos@thinx.convex.com (Jos Horsmeier) (1993-03-16) |

Re: regular expressions (bug-report) pardo@cs.washington.edu (1993-03-19) |

Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22) |

Re: regular expressions (bug-report) henry@zoo.toronto.edu (1993-03-25) |

Re: regular expressions (bug-report) kanze@us-es.sel.de (1993-03-26) |

Newsgroups: | comp.compilers |

From: | pardo@cs.washington.edu (David Keppel) |

Keywords: | lex, DFA |

Organization: | Computer Science & Engineering, U. of Washington, Seattle |

References: | 93-03-046 93-03-050 |

Date: | Fri, 19 Mar 1993 19:09:38 GMT |

*>megatest!plethorax!djones@uu4.psi.com (Dave Jones) writes:*

*>[This textbook regexp matcher has a bug...]*

mab@wdl39.wdl.loral.com (Mark A Biggar) writes:

*>[Most Unix-based RE pkgs interpretive backtracking.]*

An NFA package can, in principal, have at most as many states in the queue

as there are total states in the automaton. For an NFA, this number is

"reasonable" (e.g., linear in the size of the regular expression).

A DFA RE algorithm can represent one DFA state as a list of the

correspnding NFA states. A straightforward construction of the DFA from

the NFA, however, yields a huge DFA (the number of DFA states is, I think,

exponential in the number of NFA states).

`egrep', (and, in principle, `gnu?egrep') use a DFA-based mechanism. Al

Aho made several important observations. One is that you only *need* to

keep one DFA state at a time. On a transition, the next state can be

computed dynamically from the current state and a simulation of the

correspnding NFA states running on the NFA automaton. A second

observation is that the DFA can keep a cache of recently used DFA states

so that new states only need to be generated "rarely". A third

observation is that checking for a cache hit can be made cheap.

In principal, the running time of Aho's algorithm can be bad, since every

transition can require construction of a new DFA state. In practice, the

DFA traverses a limited number of states, and a small cache does a good

job. Combined with a Boyer-Moore prematcher, the cache mostly suffers

capacity misses (for common search patterns).

A related tool is `agrep', Udi Mamber's approximate string matching

algorithm. I don't know the details of the algorithm, but it does several

clever things very well. Source via anonymous ftp from sites that carry

linux, also (according to `archie') in

`cs.dal.ca:/pub/comp.archives/agrep',

`files1zrz.zrz.tu-berlin.de:/pub/unix/tools/agrep',

`gatekeeper.dec.com:/contrib/src/pa/agrep',

`harpo.seas.ucla.edu:/mnt/fs01/agrep',

and, probably, lots of other places.

;-D on ( Unleaded expressions ) Pardo

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.