Tue, 18 May 1993 05:07:37 GMT

Related articles |
---|

Regular Expression -> Minimal DFA algorithm markh@csd4.csd.uwm.edu (1993-05-18) |

Regular Expression -> Minimal DFA algorithm II markh@csd4.csd.uwm.edu (1993-05-20) |

Newsgroups: | comp.theory,comp.compilers |

From: | markh@csd4.csd.uwm.edu (Mark) |

Keywords: | DFA |

Organization: | Computing Services Division, University of Wisconsin - Milwaukee |

Date: | Tue, 18 May 1993 05:07:37 GMT |

Early this morning, last year, I decided to write a demo program to

illustrate a certain algorithm that efficiently converts Regular

Expressions to NFA's. This year, this afternoon, I extended it to make

conversion from Regular Expressions to minimal DFA's (The Second Annual

One Day Formal Language Project).

I'm about to describe the Reg.Ex. -> NFA algorithm below. The

conversion from NFA to DFA is standard.

Regular expressions can consist of the following:

(0) 0 ------- to denote the empty set.

(1) 1 ------- to denote the empty string.

(2) x ------- any valid C identifier, or string literal.

(3) [A] ----- denotes the regular expression 1 | A.

(4) A+ ------ denotes the regular expression A A*.

(5) A* ------ ITERATION (the set 1 | A | A A | A A A | ...)

(6) A | B --- ALTERATION (the union of A and B)

(7) A B ----- CONCATENATION

For example, these are valid regular expressions:

(a [b+ a*])+ | c* a b

a* (b a*)*

The output is the minimal deterministic finite automaton listed in

equational form.

You can get a copy of the software by anonymous ftp at csd4.csd.uwm.edu

in the directory pub/regex. You WILL need an ANSI-C compiler to compile

it. The library functions in <stdlib.h> (particularily realloc()) are

assumed to have the semantics indicated by ANSI.

THE ALGORITHM:

If E is a regular expression, then the reduction rules below will apply to

it, for when E takes on the following forms:

E = x -> x 1

E = [A] -> 1 | A

E = A+ -> A (1 | E)

E = A* -> 1 | A E

E = 0 | A -> A

E = A | 0 -> A

E = A | B -> a | b, where A and B reduce respectively to a and b.

E = 0 A -> 0

E = 1 A -> A

E = [A] B -> B | A B

E = A+ B -> A (B | E)

E = A* B -> B | A E

E = (A | B) C -> A C | B C

E = (A B) C -> A (B C)

They are NOT applied recursively to subexpressions in E, except where

explicitly noted with the alternation A | B, but are only applied to the

top-level expression.

By induction, the following can be proven to be the general normal form for E:

E = sum of terms of the form 1 and x A (0 = empty sum).

The expression, A, in each term of the form x A in the sum is considered a

state in the NFA. Each newly generated state is reduced to normal form in

the same way, and this in turn may generate more new states. When all the

states that were generated from the top-level expression have been fully

reduced, the result is an NFA representing the original expression. This

process will always halt. In fact, I believe it is LINEAR with respect to

the size of the expression's parse tree with a coefficient <= 2.

The starting state of the NFA is the top-level expression itself.

For example, take this expression

E = a* (b a*)*

for convenience, represent it as follows:

E = x0 x1, where x0 = a*, x1 = x2*, x2 = b x0

then E reduces as follows:

E = x0 x1 = a* x1 -> x3 | x1, where x3 = a E

To derive E's normal form, x1 and x3 have to be reduced:

x3: already reduced.

x1 = x2* -> 1 | x2 x1

Likewise, to derive x1's normal form, x2 x1 needs to be reduced:

x2 x1 = (b x0) x1 -> b (x0 x1) = b E.

Therefore x1 reduces to: x1 -> 1 | x4 -> 1 | b E

Therefore E reduces to: E -> x3 | x1 -> a E | 1 | b E.

The software I wrote takes these exact steps to derive an NFA from the

regular expresison E. And it just so happens that for this example, E is

the only "state" in the NFA, so this is the minimal DFA too.

Compare with the method described in Berry and Sethi ("From Regular

Expressions to Deterministic Automata" Theoretical Computer Science 1986).

After the conversion, standard techniques are used to convert the NFA into

the minimal DFA.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.