Thu, 22 Jan 2015 01:35:51 +0000 (UTC)

Related articles |
---|

Algorithm(s) to convert textual regular expressions to a transition ta costello@mitre.org (Costello, Roger L.) (2015-01-20) |

Re: Algorithm(s) to convert textual regular expressions to a transitio niciodata.eu@gmail.com (ioan) (2015-01-21) |

Re: Algorithm(s) to convert textual regular expressions to a transitio jeremy.p.wright@gmail.com (2015-01-21) |

Re: Algorithm(s) to convert textual regular expressions to a transitio kaz@kylheku.com (Kaz Kylheku) (2015-01-22) |

From: | Kaz Kylheku <kaz@kylheku.com> |

Newsgroups: | comp.compilers |

Date: | Thu, 22 Jan 2015 01:35:51 +0000 (UTC) |

Organization: | Aioe.org NNTP Server |

References: | 15-01-035 |

Keywords: | lex |

Posted-Date: | 22 Jan 2015 09:59:57 EST |

On 2015-01-20, Costello, Roger L. <costello@mitre.org> wrote:

*> I have read the section in Modern Compiler Design that discusses lexical*

*> analysis. I understand the subset algorithm it describes (neat algorithm!).*

*> With pencil and paper I can take a set of regular expressions and follow the*

*> subset algorithm to generate a transition table.*

This is not necessary. Regular expressions can also be compiled to

an NFA graph, which can be simulated. This has different performance

tradeoffs. For some regexes, the subset construction algorithm blows up

to a large number of states, whereas the NFA graph is small.

Regular expressions can also be retained as an abstract syntax tree,

and evaluated directly, using the technique of Brzozowski derivatives.

In the TXR language, I have a hybrid approach whereby a regular expression's

abstract syntax tree is analyzed, and then the appropriate strategy is chosen:

compile to NFA graph, or treat with derivatives. The strategy depends on

whether the syntax tree contains the exotic operators complement or

intersection. If they are absent then the NFA will work; the esoteric operators

are not easily compilable to an NFA. (The NFA for them exists, but

requires a substantial transformation of the regex.)

*> But writing actual code to do*

*> this conversion will require more algorithms, I think. Is there an algorithm*

*> that describes how to read in a set of strings that represent regular*

*> expressions and create data structures that are well-suited to processing by*

*> the subset algorithm?*

The subset algorithm is usually described as operating on the NFA graph.

The NFA graph simulation dynamically evaluates the algorithm: as the input

arrives, it calculates the transitive closures: *sets* of states which are

reachable when that input symbol is processed.

The subset construction essentially pre-computes all the possible state sets,

leaving just a finite machine that doesn't have to perform the closure

computations that the NFA simulator performs.

*> More broadly, what are the set of algorithms needed to*

*> go from a set of strings that represent regular expressions to the precomputed*

*> transition table? Can you refer me to a book or article that does a good job*

*> in describing this?*

In my compiler, I first go from regex syntax to Lisp-based abstract

syntax tree. This AST can then be compiled.

The API's are open. From the shell prompt:

$ txr -p '(regex-parse "a.*b")'

(compound #\a (0+ wild) #\b)

It's all prefix Lisp. (compound x y z) denotes xyz, (0+ x) is x*,

the symbol wild denotes the . and such. #\a denotes a character object,

the ASCII/Unicode 'a'.

The regex compiler function accepts either this AST, or the original

textual form:

$ txr -p "(regex-compile '(compound #\a (0+ wild) #\b))"

#/a.*b/

The printed output isn't informative. The compiler outputs a compiled regex

object which is not printable. (We would have to look at its C representation,

and inspect it in a C debugger to see its insides.) When we print that object

using the TXR language itself, a stashed copy of the AST source code is pulled

from that object to produce the understandable #/a.*b/ notation.

That is to say, the object printer does not traverse the compiled NFA to

reproduce a.*b: that is not really tractable!!!

$ txr -p "(typeof (regex-compile '(compound #\a (0+ wild) #\b)))"

sys:regex

$ txr -p "(typeof #/abc/)" # this regex is compiled at S-exp read time

sys:regex

The abstract syntax tree is produced using easy parsing techniques. In my

case, a Yacc grammar (or rather regex subgrammar in a larger language).

The grammar is here: http://www.kylheku.com/cgit/txr/tree/parser.y

The relvant productions are regex, regexpr, regbranch, regterm and others.

The AST to regex object compiler is in regex.c. The regex_compile function

is the master API. It checks whether the regex must be treated with

derivatives and either uses dv_compile_regex, or nfa_compile_regex.

dv_compile_regex doesn't do much: it expands the character set notations

into charset objects that are faster to evaluate (NFA uses them too).

It also expands the nongreedy operator R1%R2, which I designed

as a syntactic sugar that transforms into other operations.

The nfa_compile_regex function is based almost directly on the graphical

representations of the proces (bubble and arrow diagrams) in the

Red Dragon Book (_Compilers: Principles, Techniques and Tools_, 1988).

Because the input is a Lisp-like data structure, the function is

very easy to understand. Basically the whole compiler logic is in that

one simple function, supported by subroutines to make and connect

NFA states of various kinds, and produce character sets.

Character sets are tricky under Unicode! I implemented radix trees of several

different types (distinguished by their height), all with character bitmaps at

the leaf level. The choice of radix tree depends on the range of the character

set. Inverted character sets are handled as an uninverted representation,

which is flagged as inverted.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.