|DFA-based regexp matching with substring addressing email@example.com (2000-01-23)|
|From:||firstname.lastname@example.org (Ville Laurikari)|
|Date:||23 Jan 2000 10:12:44 -0500|
|Organization:||Helsinki University of Technology, CS lab|
Many regular expression matching/searching libraries have a feature
often called "substring addressing". For example, when the
expression "foo([0-9]*)bar" is matched against the string "foo123bar",
the positions corresponding to the parentheses are calculated. Here,
the opening parenthesis would have position 3 (assuming the first
character is at position 0), and the closing parenthesis would have
For example, Python's re-library supports substring addressing. The
implementation method (a backtracking matcher not based on
automatons), however, has some shortcomings. In particular, the
longest overall match is not always produced:
>>> import re
What I want is a matcher which has a commutative union operator and
always produces the longest possible match, *and* supports substring
addressing. However, I have not been able to find one. I did find
this paper, though (in fact, it was recommended to someone asking about
the same problem in comp.compilers back at 1991, IIRC):
Nakata, I. and Sassa, M.:
Regular Expressions with Semantic Rules and Their Application
to Data Structure Directed Programs,
Advances in Software Science and Technology, Vol. 3, pp. 93-108
The algorithms presented there do not handle all cases correctly,
e.g. "(a*)aaa". I believe I have heard of another paper/article
addressing this problem, but it slips my mind.
Since I really wanted a working solution, I went ahead and designed an
algorithm which compiles regular expressions to DFA-based automatons.
Extra deque-handling operations are attached to the transitions of the
DFA, to keep track of the possible indices in the input string
corresponding the parentheses of the regular expression. The matcher
does exactly what I want, for example, "(ab|a)(blip)?" matched against
"ablip" will produce substring addressing like this: "(a)(blip)".
I believe it is quite fast too, and enables building lexers which
support substring addressing rather easily, among other things.
Now, after this lengthy introduction, I can get to the question:
Is this algorithm of mine anything new? Is it maybe worth a paper?
I was quite dumbfounded by finding almost no references addressing
this problem, but maybe I was just looking in all the wrong places?
Thanks in advace,
Return to the
Search the comp.compilers archives again.