Regular Expression -> Minimal DFA algorithm

markh@csd4.csd.uwm.edu (Mark)
Tue, 18 May 1993 05:07:37 GMT

          From comp.compilers

Related articles
Regular Expression -> Minimal DFA algorithm markh@csd4.csd.uwm.edu (1993-05-18)
Regular Expression -> Minimal DFA algorithm II markh@csd4.csd.uwm.edu (1993-05-20)
| List of all articles for this month |

Newsgroups: comp.theory,comp.compilers
From: markh@csd4.csd.uwm.edu (Mark)
Keywords: DFA
Organization: Computing Services Division, University of Wisconsin - Milwaukee
Date: Tue, 18 May 1993 05:07:37 GMT

      Early this morning, last year, I decided to write a demo program to
illustrate a certain algorithm that efficiently converts Regular
Expressions to NFA's. This year, this afternoon, I extended it to make
conversion from Regular Expressions to minimal DFA's (The Second Annual
One Day Formal Language Project).


      I'm about to describe the Reg.Ex. -> NFA algorithm below. The
conversion from NFA to DFA is standard.


      Regular expressions can consist of the following:


                      (0) 0 ------- to denote the empty set.
                      (1) 1 ------- to denote the empty string.
                      (2) x ------- any valid C identifier, or string literal.
                      (3) [A] ----- denotes the regular expression 1 | A.
                      (4) A+ ------ denotes the regular expression A A*.
                      (5) A* ------ ITERATION (the set 1 | A | A A | A A A | ...)
                      (6) A | B --- ALTERATION (the union of A and B)
                      (7) A B ----- CONCATENATION


For example, these are valid regular expressions:


                        (a [b+ a*])+ | c* a b


                        a* (b a*)*


The output is the minimal deterministic finite automaton listed in
equational form.


      You can get a copy of the software by anonymous ftp at csd4.csd.uwm.edu
in the directory pub/regex. You WILL need an ANSI-C compiler to compile
it. The library functions in <stdlib.h> (particularily realloc()) are
assumed to have the semantics indicated by ANSI.


THE ALGORITHM:
      If E is a regular expression, then the reduction rules below will apply to
it, for when E takes on the following forms:


E = x -> x 1
E = [A] -> 1 | A
E = A+ -> A (1 | E)
E = A* -> 1 | A E
E = 0 | A -> A
E = A | 0 -> A
E = A | B -> a | b, where A and B reduce respectively to a and b.
E = 0 A -> 0
E = 1 A -> A
E = [A] B -> B | A B
E = A+ B -> A (B | E)
E = A* B -> B | A E
E = (A | B) C -> A C | B C
E = (A B) C -> A (B C)


They are NOT applied recursively to subexpressions in E, except where
explicitly noted with the alternation A | B, but are only applied to the
top-level expression.


By induction, the following can be proven to be the general normal form for E:


E = sum of terms of the form 1 and x A (0 = empty sum).


The expression, A, in each term of the form x A in the sum is considered a
state in the NFA. Each newly generated state is reduced to normal form in
the same way, and this in turn may generate more new states. When all the
states that were generated from the top-level expression have been fully
reduced, the result is an NFA representing the original expression. This
process will always halt. In fact, I believe it is LINEAR with respect to
the size of the expression's parse tree with a coefficient <= 2.


The starting state of the NFA is the top-level expression itself.


For example, take this expression


E = a* (b a*)*


for convenience, represent it as follows:


E = x0 x1, where x0 = a*, x1 = x2*, x2 = b x0


then E reduces as follows:


E = x0 x1 = a* x1 -> x3 | x1, where x3 = a E


To derive E's normal form, x1 and x3 have to be reduced:
x3: already reduced.
x1 = x2* -> 1 | x2 x1


Likewise, to derive x1's normal form, x2 x1 needs to be reduced:
x2 x1 = (b x0) x1 -> b (x0 x1) = b E.


Therefore x1 reduces to: x1 -> 1 | x4 -> 1 | b E
Therefore E reduces to: E -> x3 | x1 -> a E | 1 | b E.


The software I wrote takes these exact steps to derive an NFA from the
regular expresison E. And it just so happens that for this example, E is
the only "state" in the NFA, so this is the minimal DFA too.


Compare with the method described in Berry and Sethi ("From Regular
Expressions to Deterministic Automata" Theoretical Computer Science 1986).


After the conversion, standard techniques are used to convert the NFA into
the minimal DFA.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.