Re: Deterministic Finite Automata: Sub-element captures

Russ Cox <rsc@swtch.com>
Wed, 2 Sep 2009 22:48:00 -0700

          From comp.compilers

Related articles
Deterministic Finite Automata: Sub-element captures alexander.morou@gmail.com (Alexander Morou) (2009-09-02)
Re: Deterministic Finite Automata: Sub-element captures rsc@swtch.com (Russ Cox) (2009-09-02)
Re: Deterministic Finite Automata: Sub-element captures Danny.Dube@ift.ulaval.ca (2009-09-12)
| List of all articles for this month |

From: Russ Cox <rsc@swtch.com>
Newsgroups: comp.compilers
Date: Wed, 2 Sep 2009 22:48:00 -0700
Organization: Compilers Central
References: 09-09-013
Keywords: lex, DFA
Posted-Date: 03 Sep 2009 13:10:27 EDT

The closest you can get with a DFA is to take the union of the NFAs of
the various regexps you care to match and tag each NFA with a distinct
matching state. Then when you construct the DFA, you can use the
distinct NFA matching states to tell which regexps are matched by
a particular DFA state. So you could have one regexp for the
hex numbers, another for floating point, and so on. Capturing the
submatch boundaries within each regular expression requires more work:
in general a DFA doesn't track enough information to do explicit
capture tracking. You really need to back off to an NFA (in the
formal sense not the popular sense) so that you can manipulate the
states individually during the subset simulation.


I like to think of a linearized version of the NFA as a program
for a peculiar virtual machine that allows non-deterministic choice,
by letting multiple lightweight threads run over the NFA at once.
An individual NFA state can be viewed as a program counter (PC) in
this program, and the set of NFA states that corresponds to a
single DFA state during the subset simulation can be thought of
as a thread run queue: there's one run queue being executed
and a second run queue being prepared for the next position
in the input.


In the code at http://swtch.com/~rsc/regexp/regexp1.html,
the State* is the program counter, clist is the current run queue,
and nlist is the next run queue.


The non-determinism in the NFA is accomodated by sometimes
adding multiple PCs--that is, multiple thread states--to nlist.
Reasonable performance is ensured by never adding redundant
entries (same State*) to the list. Thus nlist is at most
O(# states in NFA), producing the O(# states in NFA * length of
text) run-time bound.


So far I am sure I've said nothing you didn't already know,
except maybe the weird thread analogy. But given this basic
framework, it is easy to add submatch tracking.


To add submatch tracking, you expand the representation of a
thread on the run queue. Before it was just an NFA state (a PC)
but now we add the submatch boundaries crossed during the
thread's execution so far (capture registers, if you will).
For example, if the regexp is (a(a*)ab), there are four capture
registers, one for each parenthesis. Each parenthesis becomes
an empty transition in the NFA tagged with a register index.
As a thread crosses that transition, it stores the current text
position in the indexed register. In this implementation, it is
still important to bound the length of the run queue, and again
the solution is to toss out multiple threads with the same PC,
even if they have different capture register sets. The capture
register sets cannot influence the future computation of the
threads, so there's no need for both threads: keep the ``better''
one and discard the other. This pruning gives the same run-time
bound as in the non-capturing NFA.


The only thing left is to define what a ``better'' thread is.
The rule in backtracking implementations is to prioritize each
non-deterministic choice: in "a|b", the higher-priority choice
is the "a"; in "a*", it is to continue the repetition; in "a*?"
it is to stop the repetition; and so on. The choice corresponds
precisely to which path the backtracking explores first.
The simulation I just described explores them all simultaneously,
but it is not hard to maintain the run queue so that the threads
appear in their backtracking priority order.


To see how the priority works, suppose a particular thread's
execution path were summarized by a bit string recording the
choices it made. For simplicity, let's call 0 the high-priority
choice and 1 the low-priority choice. With this representation,
comparing the strings lexicographically suffices to determine
which thread is ``better:'' 0000 beats 0100 beats 0111 beats 1000.
It's not necessary to record this history explicitly. Instead,
if the run queue is maintained and traversed in priority order
at each step, each new run queue will end up in priority order.
Suppose there are four threads with histories (a) 0000, (b) 0100,
(c) 0111, and (d) 1000. Then the run queue order is a, b, c, d.
Executing thread (a) can only lead to new threads with histories
beginning with 0000: 00000 and 00001. Executing thread (b) can
only lead to 01000 and 01001. And so on. Because each thread
has a distinct summary, the relative priorities of the successor
threads can't cross: any new thread created during the exploration
from state (a) will necessarily have a higher priority than a thread
created during the exploration from state (b). So as long as the
current run queue is keep in priority order and the choices are
explored in priority order, the next run queue will also be in
priority order. If a thread is already on the next run queue
with a given PC, then any other thread with the same PC is lower
priority and can be discarded.


Since the threads are executed in priority order, once a match
is found, all the remaining threads on the run queue can be
killed early--they can't possibly lead to a higher priority
match. Then once the higher priority threads have all died off
naturally, the most recent match found is necessarily the best one.


An alternate approach is to put capturing parentheses around the
entire expression, so that you can tell where each thread's match
begins, and kill off only the threads that start later in the
text. Then keep the match that ends latest in the text. This
produces the so-called ``leftmost longest'' match in a single
linear-time pass through the text, something even most DFA
implementations don't do (it's possible, but that's another story).
Of course, you still need a rule for pruning threads, and you
might as well use the backtracking priorities; you'll end up with
the leftmost longest match and the submatches that a backtracking
engine would have chosen during that match.


The C program at http://swtch.com/~rsc/regexp/nfa-perl.y.txt
implements the algorithm I've just described. I believe that the
regular expression library Rob Pike wrote in the early 1980s was
the first to use this algorithm. It didn't bother trying to mimic
the backtracking priorities but otherwise matches my description.
That library got adapted to become Eighth Edition Unix's regexp(3)
library; Henry Spencer wrote a public domain workalike that used
backtracking instead, and the rest is history. Neither looked
closely enough at the innards of the other library to realize the
difference in implementations. It didn't help that both approaches
call themselves NFA algorithms, nor did it help that Pike's code
wasn't made freely available until 1992 (as part of the text editor
sam). In the interim it became conventional wisdom that to track
submatches one must resort to backtracking. Ville Laurikari
discovered an algorithm with the same time and space complexity
as the above in 2000 and published it (http://laurikari.net/tre).
I suspect that deep down they are variants of a single algorithm,
but I haven't tried to work out the correspondence.


As you might imagine, this NFA simulation is more expensive than
the the simple one-memory-lookup-per-input-character DFA, which is
why most lexers don't tell you anything beyond the particular regexp
that matched.


Russ




P.S. Tracking a capture register set is a simple generalization of an
idea that was known to at least one of Aho, Hopcroft, or Ullman in 1974
but then disappeared from the literature, including their later books.
In _The Design and Analysis of Computer Algorithms_, Chapter 9
(Pattern Matching Algorithms) describes the usual NFA simulation
and then gives these two exercises:


    9.6 Let x = a_1 a_2 ... a_n be a given string and alpha a regular
            expression. Modify Algorithm 9.1 [the NFA simulation] to find
            the least k and, having found k, the (a) least j and (b) the
            greatest j such that a_j a_j+1 ... a_k is in the set denoted
            by alpha. [Hint: Associate an integer j with each state in S_i.]


  *9.7 Let x and alpha be as in Exercise 9.6. Modify Algorithm 9.1
            to find the least j and, having found j, the greatest k such
            that a_j a_j+1 ... a_k is in the set denoted by alpha.


The star on 9.7 is theirs, presumably an indication of difficulty
(I filed away a photocopy of that section but don't have the
entire book handy).




P.P.S. I didn't mention the POSIX rules for choosing submatches;
they were designed to be easy to specify, but I doubt anyone on
the committee knew how to implement them efficiently. The usual
suggestion is to run a backtracking engine until it has enumerated
all possible matches and then pick the one that satisfies the
POSIX rules; this turns a worst-case exponential but common case
faster algorithm into a common-case exponential algorithm, which
is why basically no one implements it. I believe that the POSIX
submatch can be found in a single linear-time pass through the text
in the spirit of the above algorithm, but you have to process the
text backward to hit the matches in an order that enables the right
``better'' comparison. http://swtch.com/~rsc/regexp/nfa-posix.y.txt
has an implementation, though I have not tested it extensively.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.