Re: Algorithm(s) to convert textual regular expressions to a transition table?

Kaz Kylheku <kaz@kylheku.com>
Thu, 22 Jan 2015 01:35:51 +0000 (UTC)

          From comp.compilers

Related articles
Algorithm(s) to convert textual regular expressions to a transition ta costello@mitre.org (Costello, Roger L.) (2015-01-20)
Re: Algorithm(s) to convert textual regular expressions to a transitio niciodata.eu@gmail.com (ioan) (2015-01-21)
Re: Algorithm(s) to convert textual regular expressions to a transitio jeremy.p.wright@gmail.com (2015-01-21)
Re: Algorithm(s) to convert textual regular expressions to a transitio kaz@kylheku.com (Kaz Kylheku) (2015-01-22)
| List of all articles for this month |

From: Kaz Kylheku <kaz@kylheku.com>
Newsgroups: comp.compilers
Date: Thu, 22 Jan 2015 01:35:51 +0000 (UTC)
Organization: Aioe.org NNTP Server
References: 15-01-035
Keywords: lex
Posted-Date: 22 Jan 2015 09:59:57 EST

On 2015-01-20, Costello, Roger L. <costello@mitre.org> wrote:
> I have read the section in Modern Compiler Design that discusses lexical
> analysis. I understand the subset algorithm it describes (neat algorithm!).
> With pencil and paper I can take a set of regular expressions and follow the
> subset algorithm to generate a transition table.


This is not necessary. Regular expressions can also be compiled to
an NFA graph, which can be simulated. This has different performance
tradeoffs. For some regexes, the subset construction algorithm blows up
to a large number of states, whereas the NFA graph is small.


Regular expressions can also be retained as an abstract syntax tree,
and evaluated directly, using the technique of Brzozowski derivatives.


In the TXR language, I have a hybrid approach whereby a regular expression's
abstract syntax tree is analyzed, and then the appropriate strategy is chosen:
compile to NFA graph, or treat with derivatives. The strategy depends on
whether the syntax tree contains the exotic operators complement or
intersection. If they are absent then the NFA will work; the esoteric operators
are not easily compilable to an NFA. (The NFA for them exists, but
requires a substantial transformation of the regex.)


> But writing actual code to do
> this conversion will require more algorithms, I think. Is there an algorithm
> that describes how to read in a set of strings that represent regular
> expressions and create data structures that are well-suited to processing by
> the subset algorithm?


The subset algorithm is usually described as operating on the NFA graph.


The NFA graph simulation dynamically evaluates the algorithm: as the input
arrives, it calculates the transitive closures: *sets* of states which are
reachable when that input symbol is processed.


The subset construction essentially pre-computes all the possible state sets,
leaving just a finite machine that doesn't have to perform the closure
computations that the NFA simulator performs.


> More broadly, what are the set of algorithms needed to
> go from a set of strings that represent regular expressions to the precomputed
> transition table? Can you refer me to a book or article that does a good job
> in describing this?


In my compiler, I first go from regex syntax to Lisp-based abstract
syntax tree. This AST can then be compiled.


The API's are open. From the shell prompt:


    $ txr -p '(regex-parse "a.*b")'
    (compound #\a (0+ wild) #\b)


It's all prefix Lisp. (compound x y z) denotes xyz, (0+ x) is x*,
the symbol wild denotes the . and such. #\a denotes a character object,
the ASCII/Unicode 'a'.


The regex compiler function accepts either this AST, or the original
textual form:


    $ txr -p "(regex-compile '(compound #\a (0+ wild) #\b))"
    #/a.*b/


The printed output isn't informative. The compiler outputs a compiled regex
object which is not printable. (We would have to look at its C representation,
and inspect it in a C debugger to see its insides.) When we print that object
using the TXR language itself, a stashed copy of the AST source code is pulled
from that object to produce the understandable #/a.*b/ notation.
That is to say, the object printer does not traverse the compiled NFA to
reproduce a.*b: that is not really tractable!!!


$ txr -p "(typeof (regex-compile '(compound #\a (0+ wild) #\b)))"
sys:regex
$ txr -p "(typeof #/abc/)" # this regex is compiled at S-exp read time
sys:regex


The abstract syntax tree is produced using easy parsing techniques. In my
case, a Yacc grammar (or rather regex subgrammar in a larger language).


The grammar is here: http://www.kylheku.com/cgit/txr/tree/parser.y
The relvant productions are regex, regexpr, regbranch, regterm and others.


The AST to regex object compiler is in regex.c. The regex_compile function
is the master API. It checks whether the regex must be treated with
derivatives and either uses dv_compile_regex, or nfa_compile_regex.


dv_compile_regex doesn't do much: it expands the character set notations
into charset objects that are faster to evaluate (NFA uses them too).
It also expands the nongreedy operator R1%R2, which I designed
as a syntactic sugar that transforms into other operations.


The nfa_compile_regex function is based almost directly on the graphical
representations of the proces (bubble and arrow diagrams) in the
Red Dragon Book (_Compilers: Principles, Techniques and Tools_, 1988).


Because the input is a Lisp-like data structure, the function is
very easy to understand. Basically the whole compiler logic is in that
one simple function, supported by subroutines to make and connect
NFA states of various kinds, and produce character sets.


Character sets are tricky under Unicode! I implemented radix trees of several
different types (distinguished by their height), all with character bitmaps at
the leaf level. The choice of radix tree depends on the range of the character
set. Inverted character sets are handled as an uninverted representation,
which is flagged as inverted.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.