Re: Non-deter. regular expr. -> deter. regular exp

mark@omnifest.uwm.edu (Mark Hopkins)
27 Mar 1996 00:07:21 -0500

From comp.compilers

Related articles
Non-deter. regular expr. -> deter. regular expr. faase@cs.utwente.nl (1996-03-20)
Re: Non-deter. regular expr. -> deter. regular expr. leichter@smarts.com (Jerry Leichter) (1996-03-22)
Re: Non-deter. regular expr. -> deter. regular exp mark@omnifest.uwm.edu (1996-03-27)
*Re: Non-deter. regular expr. -> deter. regular exp mark@omnifest.uwm.edu* (1996-03-27)**

| List of all articles for this month |

From:	mark@omnifest.uwm.edu (Mark Hopkins)
Newsgroups:	comp.compilers
Date:	27 Mar 1996 00:07:21 -0500
Organization:	Omnifest
References:	96-03-125 96-03-152
Keywords:	DFA, theory

      As far as I know, there is no such thing as a "non-deterministic"
regular expression, since every r.e. can be recognized by a
deterministic finite state automaton. So I have to interpret your
question to mean: Is there a purely algebraic way to extract a DFA
from a regular expression? There is.

      A regular expression can be represented as a finite sum of terms,
each of which is a finite product of factors. A factor is either a
terminal or a starred regular expression. The type of algebra that
applies to regular expressions is called a semi-ring. Its axioms are:

                        a(bc) = (ab)c a+(b+c) = (a+b)+c
                          a1 = a = 1a a+0 = a = 0+a
                            a0d = 0 a+b = b+a
                    a(b+c)d = abd+acd

In addition, you have the property: a + a = a, which makes the algebra
an idempotent semi-ring. An expression in an idempotent semi-ring can
be represented as a finite set of terms, with each term a finite
sequence of factors and "factor" defined as above. Here, I'm using
the notation:

                                                a + b <--> a union b
                                                    1 <--> empty string
                                                    0 <--> empty set

      Therefore, by using the axioms in the following forms:

Set Manipulations:
                                                      a --> 0 + a
                                  (a + b) + c --> (a + c) + b to sort
                                              a + a --> a to remove duplicates.
                                              a + 0 --> a to remove trailing zeros.
                                  a + (b + c) --> (a + b) + c to associate to the left.
Sequence Manipulations:
                                                      a --> 1a
                                                    a1 --> a to remove trailing ones.
                                              (ab)c --> a(bc) to associate to the right.
Distributivity:
                                                    a0 --> 0
                                        a(b + c) --> ab + ac

you arrive at a normal form for expressions that can be mapped one-to-one to
the "finite set of finite sequence" form described above.

Example: a* ((b + c) d) + e (1 (f + g))
                          --> ((((0 + ((1 a*) b) d) + ((1 a*) c) d) + (1 e) f) + (1 e) g
                              = a* b d + a* c d + e f + e g
                                                      <-->
                                { (a*, b, d), (a*, c, d), (e, f), (e, g) }

by repeated applications of the steps above, where I'm assuming that the
four terms are already sorted.

      Afterwards you can strip off any leading 0's and 1's with the reductions:

                                                  0 + a --> a
                                                        1a --> a

      You can convert each expression into a system of equations which it will be
the minimal solution to. Each equation in the system will have the form:

                Taylor's Theorem:
                            E = E_0 + x dE/dx + y dE/dy + ... + z dE/dz

where x, y, ..., z are the terminals, and E_0 is either 0 or 1. This is
called Right-Linear form, since all the terminals (or "variables") are on
the right-hand side. The right-hand side will be "deterministic" in the
sense that no more than one term will contain a given terminal after it is
reduced to normal form.

      E_0 is obtained by substituting 0 in for all the terminals in E.

      d/dx is a "partial differential operator" and is defined by the following:

                              d1/dx = d0/dx = 0
                              dx/dx = 1, dy/dx = 0 if y is a terminal other than x.
                              d(A*)/dx = dA/dx A*
                              d(AB)/dx = dA/dx B + A_0 dB/dx
                              d(A+B)/dx = dA/dx + dB/dx

Note the variation in the product rule. You can easily extend this to the
other operations [A] = 1 + A and A+ = A A*, by the following:

                                    d[A]/dx = d(1 + A)/dx = d1/dx + dA/dx = 0 + dA/dx = dA/dx
                                    d(A+)/dx = dA/dx A* + A_0 dA/dx A*
                                                      = (1 + A_0) dA/dx A*
                                                      = dA/dx A*

since 1 + A_0 is either 1 + 0 = 1, or 1 + 1 = 1.

      Define the total differential dE as the sum of all the terms x dE/dx. Then
you can write:
                                              E = E_0 + dE
with
                                          d0 = d1 = 0
                                          dx = x, for all terminals x
                                          d(A*) = d(A+) = dA A*
                                          d([A]) = dA
                                          d(AB) = dA B + A_0 dB
                                          d(A+B) = dA + dB

Also, you can define E_0 by the following table:

                                              0_0 = x_0 = 0
                                              1_0 = [A]_0 = (A*)_0 = 1
                                              (A+)_0 = A_0
                                              (A+B)_0 = A_0 + B_0
                                              (AB)_0 = A_0 B_0

which is equivalent to making the above-mentioned substitutions.

      So apply Taylor's Theorem and reduce the result to normal form.

Example: a* (b a*)*
                  (a* (b a*)*)_0 = 0* (0 0*)* = 1 (0)* = 1

                  d(a* (b a*)*) = a (a* (b a*)*) + (b a*) (b a*)*
                                          --> a (a* (b a*)*) + b (a* (b a*)*)

Which establishes a* (b a*)* as a solution to a system consisting of only
one equation, of the form:

                                                  x = 1 + a x + b x

In general, there will be more than one expression and more than one equation.
When arriving at the result:

                                      E = E_0 + x dE/dx + y dE/dy + ... + z dE/dz

each of the derivatives dE/dx, dE/dy and dE/dz will have to be likewise
reduced if they already haven't. This is where the extra equations will
come from.

      The automaton is read off this as follows:

                (1) Each expression is a state
                (2) State E is final if E_0 = 1.
                (3) State E has a transition to state F on x if dE/dx = F.

---------------

      To actually prove that each expression is a solution to its corresponding
system of equations (and that it's also the least solution), you need to
extend the algebra by adding in the following axioms:

                                                            1 + a a* = a*
                                    if a + bx + xc <= x then b* a c* <= x

where (u <= v) is defined by (u + v = v).

      From this you can ultimately show that

                                                      0* = 1* = 1
                                            a* = 1 + a a* = 1 + a* a

and other similar properties, as well as showing that the following equations
have the following least solutions:

                              x = a + bx x = b* a
                              x = a + xc x = a c*
                              x = a + bx + xc x = b* a c*
                              x = a + xdx x = a (d a)* = (a d)* a
                              x = a + bx + xc + xdx x = b* a c* (d b* a c*)*

For the example above, this implies that:

                                        x = 1 + ax + bx = 1 + (a + b)x

has (a + b)* as a least solution, whose equality to a* (b a*)* is also
proveable from the two axioms above.

--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Non-deter. regular expr. -> deter. regular exp

mark@omnifest.uwm.edu (Mark Hopkins)27 Mar 1996 00:07:21 -0500

mark@omnifest.uwm.edu (Mark Hopkins)
27 Mar 1996 00:07:21 -0500