Sun, 12 Mar 1995 19:52:29 GMT

Related articles |
---|

A tree-based regular expression matching algorithm cmcconne@reed.edu (1995-03-12) |

Newsgroups: | comp.compilers |

From: | cmcconne@reed.edu (Carl McConnell) |

Keywords: | AST, DFA |

Organization: | Reed College, Portland, Oregon |

Date: | Sun, 12 Mar 1995 19:52:29 GMT |

I recently came up with a regular expression matching algorithm that I

thought others might find interesting. It uses the abstract syntax

tree (AST) for the regular expression rather than an NFA. It is

essentially an AST-based analog of the traditional NFA simulation

algorithm. The running time of this algorithm is the same as for NFA

simulation: O(rs), where r is the length of the regular expression and

s is the length of the string being matched. However, its space

consumption is worse than that of NFA simulation: O(rs), compared to

O(r) for NFA simulation. Furthermore, the AST-based algorithm requires

the ability to seek back over the input, rather than being able to

examine each character just once. Thus, the AST algorithm is

basically inferior to NFA simulation. However, I think it's

interesting conceptually.

If anyone has seen it published somewhere, I'd be interested in

hearing about it.

Carl McConnell

=== algorithm description begins ===

The intuition behind the algorithm is that regular expressions are

somewhat analogous to the usual control structures. For example,

the regular expression R* vaguely resembles the loop

"while match(R) do nothing", where "match(R)" is an as-yet undefined

function that can match a regular expression against the input.

Likewise, the regular expression R|S resembles the conditional

"match(R) or match(S)", and RS the conditional "match(R) and

match(S)". Thus, the regular expression a*ab resembles

the expression "(while match(a) do nothing) and match(a) and match(b)",

where the loop returns a boolean. (This analogy is just meant to be

suggestive, as it obviously breaks down pretty fast.)

Of course, the problem is how to account for the non-determinism.

Continuing with the previous regular expression, in matching against

the string aaab, the "loop" for a* mustn't read all the a's, or else

match(a) will fail when it could have succeeded. To get around this,

we define match() to return a set of positions in the input stream at

which subsequent matching may begin. (The drawback here is that the

number of positions is unbounded; instead of a finite state machine,

we essentially have an infinite data machine.) As arguments,

match(R, stream, positions) takes a regular expression, an input

stream, and a set of positions in that input stream where matching may

begin. (We'll assume the first position in the input stream is 1.)

This leads to the definition for match() given below. The invocation

is match(R, stream, {1}), and the result is the set of positions for

which the input stream matches R: if p is a position, then the

contents of the input stream from 1 to p-1 match R. We can find the

usually desired longest match by using the maximum element of the

result. If the result is empty, there were no matches.

match(R|S, stream, positions) =

match(R, stream, positions) | match(S, stream, positions)

match(RS, stream, positions) =

match(R, stream, positions) & match(S, stream, positions)

match(R*, stream, positions) =

closure(R, stream, positions)

match(c, stream, positions) =

matchCharacter(c, stream, positions)

where "|" is set union and "&" set intersection, and the auxiliary functions

are

closure(R, stream, positions)

newPositions := match(R, stream, positions);

if (newPositions = positions)

return positions;

return closure(R, stream, positions | newPositions);

matchCharacter(c, stream, positions)

newPositions := {};

for each position p in positions do

if (c = stream[p])

newPositions := newPositions | {p+1};

return newPositions;

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.