The Equational Approach: (1) Regular Expressions -> DFA's.

markh@csd4.csd.uwm.edu (Mark)
Sat, 2 Oct 1993 17:47:13 GMT

          From comp.compilers

Related articles
The Equational Approach: (1) Regular Expressions -> DFA's. markh@csd4.csd.uwm.edu (1993-10-02)
| List of all articles for this month |

Newsgroups: comp.compilers,comp.theory
From: markh@csd4.csd.uwm.edu (Mark)
Keywords: parse, theory
Organization: Computing Services Division, University of Wisconsin - Milwaukee
Date: Sat, 2 Oct 1993 17:47:13 GMT

(1.1) Introduction
      This article is a lead into a brief series of articles that illustrate
the algebraic and equational approach to formal language theory, and its
applications, such as constructing efficient and intuitive parsers, and
performing optimizations. I will demonstrate why it is actually better to
view languages, grammars, and automata as systems of equations, by
exploiting this view throughout. This is basis of methods I've been using
for quite a while now to write language software of all sorts, and for
performing symbolic manipulations on languages and automata.


    Since the question indicated by the topic has been brought up a few
times in comp.compilers, the initial article is also being posted there.
Follow-ups will be send exclusively to comp.theory. Another question that
will be indirectly answered down the line is how to generate parsers for
context free grammars written in EBNF notation.


      A couple months back, I described a method for reducing regular
expressions to finite automata that involves performing algebraic
manipulations on regular expressions. The goal of these calculations is
to get a system of equations of the form:


                                    Q = 1 + x1 Q1 + x2 Q2 + ...
or Q = x1 Q1 + x2 Q2 + ...


which is the resulting automaton in equational form.


      What's described below is a formalization of the method, and terms.
There are slight differences between the algorithm described here and the
one originally presented, this version tends to be a bit more powerful in
that it can handle a wider range of regular expression operators.


      A regular expression over an alphabet, X, is built out of the following
items and operations:


                      (0) 0 ------- to denote the empty set.
                      (1) 1 ------- to denote the empty string in X*.
                      (2) x ------- an element of X.
                      (3) [A] ----- denotes the regular expression 1 + A.
                      (4) A+ ------ denotes the regular expression A A*.
                      (5) A* ------ ITERATION (the set 1 + A + A A + A A A + ...)
                      (6) A + B --- ALTERATION (the union of A and B)
                      (7) A B ----- CONCATENATION


the following operation may also be included:


                      (8) A - B --- DIFFERENCE (the set difference of A and B).


For example, these are valid regular expressions:


                        (a [b+ a*])+ + c* a b (E1)


                        a* (b a*)* (E2)


The goal in converting a regular expression to a DFA is to get a series of
linear equations between a number of regular expressions { Q0, Q1 ...
Q{q-1} ), with Q0 being identified as the original expression. For
example, systems of equations corresponding to the regular expressions
above are:


                        (a [b+ a*])+ + c* a b (S1)
                        Q0 = a Q1 + c Q2 (Q0 = (a [b+ a*])+ + c* a b)
                        Q1 = 1 + a Q1 + b Q1 (Q1 = (a + b)*)
                        Q2 = a Q3 + c Q2 (Q2 = c* a b)
                        Q3 = b Q4 (Q3 = b)
                        Q4 = 1 (Q4 = 1)


                        a* (b a*)* (S2)
                        Q0 = 1 + a Q0 + b Q0


The right hand side of each equation is a sum of terms of one of the forms:


                                                            x Q or 1


Because all the Q's appear only once in each term, and as a factor on the
right hand side of the term, the system is known as a Right Linear System.


(1.2) Automaton = System of Equations
      If you've ever had to deal with the task of writing a program to do
symbolic manipulations on circuits, one thing would have stood out in
time: a circuit diagram is actually a concise visual encoding of a
mathematical system of relations. Each component stands for a equation
(or more generally, a relation) between the currents and voltages of its
terminals, each line represents a variable representing a voltage.


      The same is true for automata, though this fact is not widely
appreciated. A Finite Automaton is equivalent to a Right Linear System of
equations with the following relations:


                              States ............... Expressions Q0, Q1,..., Q{q-1}
                              Start State .......... Q0
                              Q a final state ...... Q = 1 + ...
                              Arcs Q --[x]--> Q' ... Q = ... + x Q' + ...


The transition function (or more generally, the transition relation) of
the automaton is thus effectively coded as the system itself.


      As an exercise, draw the automata corresponding to the systems S1 and
S2 listed above using this correspondence.


      If you've ever studied Quantum Field Theory, you'll notice a passing
resemblance in this way of approaching matters to what is done there
(where the terms in perturbation expansions are represented by diagrams,
with rules for manipulating them).


      Formally, if the transition function of the automaton is d: Q x X -> Q,
where Q is the set of states, and if \Q is defined to be 1 when Q is a
final state, and 0 otherwise, then the system of equations derived from the
automaton can be represented compactly as:


                                        Q = \Q + sum for all x in X: (x d(Q, x))


      Since this correspondence is so readily transparent, I won't even
bother to make the distinction and will consider an automaton to be
nothing more than a fancy way of depicting a system of equations.


      This correspondence, incidentally, generalizes to Push Down Automata,
as you will see later.


(1.3) The Factoring Property.
      Every regular expression over the alphabet X is a set of words taken
from X. If this expression is denoted E, then make the following
definitions:


                                  \E = 1 if 1 is contained in the regular expression E
                                        = 0 else


and for all symbols x in X:


                                            x\E = { w: x w is contained in E }


The latter operator bears a degree of resemblance to the partial
derivative operator and so the same name is sometimes used (and a notation
less constrained by the ASCII character set, more resembling the
operator).


      Take any element of the regular expression E. It's a word in X*.
Either this word is the empty word (in which case \E = 1), or it begins
with a symbol, x, and has the form x w (in which case w is in the set x\E,
and x w in the set x (x\E)). Thus the FACTORING PROPERTY:


                                    E = \E + sum for all x in X: x (x\E)


Any method used to perform this decomposition will, as a consequence,
convert a regular expression into a system of Right Linear Equations -- a
finite automaton.


(1.4) Factoring Regular Expressions
      From the definitions of \E and x\E, you can derive the basic properties
(which could have also served to define the operators) with respect to which
the calculations can be carried out. Experience proves that it's actually
better NOT to calculate each of the x\E's individually, but rather en masse.
Therefore, define the Total Differential Operator:


                                                  dE = sum x (x\E)


A consequence of this definition is that:


                                                    E = \E + dE


The following properties are then evident (and should be very easy for you
to prove -- which is your exercise :)):


                      E \E x\E dE
                      0 0 0 0
                      1 1 0 0
                      x 0 1 x
                      y 0 0 y
                    [A] 1 x\A dA
                      A* 1 x\A A* dA A*
                      A+ \A x\A A* dA A*
                    A B \A \B \A x\B + x\A B \A dB + dA B
                  A + B \A + \B x\A + x\B dA + dB
                  A - B \A - \B x\A - x\B dA - dB


where it's assumed that x and y are distinct. Note that 1 - 1 = 0, 1 - 0 = 0,
0 - 0 = 0, BUT 0 - 1 = 0 in this algebra.


(1.5) Calculation Method
      The calculation method comes straight out of the table, after noting the
following rules:


                                (a) (A B) C = A (B C)
                                          x = x 1, where x is any symbol
                                          1 A = A
                                (b) Terms can be ordered and grouped in any way.
                                          (A + B) C = A C + B C
                                          A + A = A
                                          0 + A = A
                                          x A + x B = x (A + B)
                                (c) x A - x B = x (A - B)
                                          A - A = 0
                                          0 - A = 0


To decompose a regular expression into its factors, make the following
transformations:


                                    [A] = 1 + dA
                                      A* = 1 + dA A*
                                      A+ = \A + dA A*
                                    A B = (\A \B) + \A dB + dA B
                                A + B = (\A + \B) + (dA + dB)


using the table to calculate the terms \A, \B, dA, dB. It is to be
understood that these reductions only apply to the expression itself, NOT
TO ANY OF ITS SUBEXPRESSIONS. Apply rules (b), and (c) as needed to
collect all the like terms, and apply rules (a), (b), (c) as needed to get
the final result. This system of reduction rules, by the way, will always
lead to the desired result in a finite number of steps, because the terms
being operated on by the dA and \A operators grow progressively smaller as
you carry out the reduction.


The result of this calculation is to calculate the state of the resulting
finite automata corresponding to the original expression. To calculate
the other states, you will then have to decompose each of the expressions
appearing in the factoring that have not already been decomposed. An
implementation will take care to store the expressions encountered in a
hash table in order to avoid duplications.


(1.6) Examples
                        (a* b)* a* (E3)


        d ( (a* b)* a* ) = d(a* b) (a* b)* a* + da a*
                                          = (a a* b + b) (a* b)* a* + a a*
                                          = a (a* b (a* b)* + a*) a* + b (a* b)* a


        d ( a* b (a* b)* a* + a*) = a a* b (a* b)* a* + b (a* b)* a* + a a*
                                                            = a (a* b (a* b)* a* + a*) + b (a* b)* a*


        \ ( (a* b)* a*) = \(a* b)* \a* = 1 1 = 1
        \ ( a* b (a* b)* + a*) = \(a* b (a* b)*) + \a* = \(a* b (a* b)*) + 1 = 1


Resulting from this is a 2-state finite automaton given by the system of
equations:


                    Q0 = 1 + a Q1 + b Q0 (Q0 = (a* b)* a*) (S3)
                    Q1 = 1 + a Q1 + b Q0 (Q1 = a* b (a* b)* a* + a*)


A minimalization step can be added to eliminate equivalent terms, which
will lead to a 1-state minimal DFA (Q0 = 1 + a Q0 + b Q0). In the
presence of the difference operator it will also be necessary to eliminate
useless states (those that generate no words), before minimalization.


      Another example that uses the difference operator is the following:


                                                (a + b)* - b* a a b* (E4)


For convenience, label the following expressions (I'm cheating by looking
ahead):
                                Q3 = (a + b)*
                                Q0 = (a + b)* - b* a a b* = Q3 - b* a a b*
                                Q1 = (a + b)* - a b* = Q3 - a b*
                                Q2 = (a + b)* - b* = Q3 - b*


then:
                              \Q3 = \(a + b)* = 1
                              d Q3 = (a + b)(a + b)* = a Q3 + b Q3


                              \Q0 = \Q3 - \(b* a a b*) = 1 - 0 = 1
                              d Q0
                          = d Q3 - d (b* a a b*)
                          = a Q3 + b Q3 - (b b* a a b* + a a b*)
                          = a (Q3 - a b*) + b (Q3 - b* a a b*)
                          = a Q1 + b Q0


                              \Q1 = \Q3 - \(a b*) = 1 - 0 = 1
                              d Q1
                          = d Q3 - d(a b*)
                          = a Q3 + b Q3 - a b*
                          = a (Q3 - b*) + b Q3
                          = a Q2 + b Q3


                              \Q2 = \Q3 - \b* = 1 - 1 = 0
                              d Q2
                          = d Q3 - d b*
                          = a Q3 + b Q3 - b b*
                          = a Q3 + b ( Q3 - b*)
                          = a Q3 + b Q2


The result is a 4 state automaton:


                                  Q0 = 1 + a Q1 + b Q0 (S4)
                                  Q1 = 1 + a Q2 + b Q3
                                  Q2 = a Q3 + b Q2
                                  Q3 = 1 + a Q3 + b Q3


(1.7) Complexity
      This algorithm (and all algorithms for converting R.E.'s to DFA's) will
generate at least an exponential number of states in the worst case. In
particular, the regular expression:


                                                (a + b)* a (a + b)^n


has a 2^(n+1) state minimal DFA. There are algorithms that will convert
regular expressions into NFA's in polynomial (and possibly even linear)
time, the R.E. -> NFA algorithm I presented one here a while back does.
But what's also common, as an approach, is to use a LAZY implementation of
the R.E. -> DFA conversion. That is, instead of converting the regular
expression to a DFA before processing input, defer expansion of a state
until it is actually encountered in an input.


      The conversion to an NFA can be accomplished in the same way as
described above, if the difference operator (A - B) is not used, and if
the rule:


                                                  x A + x B = x (A + B)


is no longer used. For example, converting the expression:


                                        (a + b)* a (a + b) (a + b) (a + b)


you get the following system (exercise: do the calculations):


                                            Q0 = a Q0 + b Q0 + a Q1
                                            Q1 = a Q2 + b Q2
                                            Q2 = a Q3 + b Q3
                                            Q3 = a Q4 + b Q4
                                            Q4 = 1


whereas if you attempted to combine a Q0 and a Q1 into a single term a (Q0
+ Q1) and expand Q0 + Q1 (and each of the terms spawned from combining the
terms on the right hand side) the automaton will blow up into 16 states.


(1.8) Derivation Sequences = Inequalities
      One can define an inequality relation on regular expressions by the
following:


                                              A >= B iff A = A + B


(which is just a way of saying B is a subset of A, without mentioning sets).


Then the following properties hold:


                                  (a) A >= \A, A >= x x\A, A >= dA
                                  (b) A >= B if and only if \A >= \B and x\A >= x\B
                                  (c) A >= 0
                                  (d) if A >= B and B >= A then A = B
                                  (e) if A >= B and B >= C then A >= C
                                  (f) A >= A


which effectively serves as an alternate definition.


      When you carry out a reduction of a state Q, to derive a word w in the
language, you're actually performing a series of inequalities to prove
that Q >= w. For example, with the finite automaton S4


                                  Q0 = 1 + a Q1 + b Q0 (S4)
                                  Q1 = 1 + a Q2 + b Q3
                                  Q2 = a Q3 + b Q2
                                  Q3 = 1 + a Q3 + b Q3


you can make the following derivation:


                          Q0 >= a Q1 >= a a Q2 >= a a 1 = a a


thus, Q0 >= a a.


      A consequence of this definition is the following property:


                                            E = { w in X*: E >= w }


that is, every regular expression is equal to the set of words derived
from it.


(1.9) Infinite Automata and Recursive Systems of Equations
      Suppose you attempt to treat the following set as a regular expression
and factor it:


                                          S = { a^n b^n: n >= 0 }.


This set is effectively an infinite regular expression given by the
following recursive equation:


                              S = 1 + a b + a a b b + a a a b b b + ...
                                  = 1 + a (1 + a b + a a b b + ... ) b
                                  = 1 + a S b


If you attempt to construct a DFA for it, you will end up with an
infinite number of states. In fact, we'll do that. Define the following
classes of expressions:


                                          S(n) = S b^n n >= 0
                                          T(n) = b^n


Then the following holds:


                                      \T(0) = 1
                                      \T(n) = 0, if n > 0
                                      dT(0) = 0
                                      dT(n) = b T(n - 1) if n > 0


                                      \S(n) = \S \b^n = \(1 + a S b) \T(n) = 1 \T(n)
                                                  = \T(n)
                                      dS(n) = dS b^n + d(b^n)
                                                  = a S b b^n + dT(n)
                                                  = a S(n + 1) + dT(n)


Therefore, you get the following (infinite) system of equations:


                                      S(0) = 1 + a S(1)
                                      S(n) = b T(n - 1) + a S(n + 1) if n > 0
                                      T(0) = 1
                                      T(n) = b T(n - 1), if n > 0


This represents the infinite automaton:


                                    [[S(0)]]
                                          | a
                                          v b
                                        S(1) ---------> [[T(0)]]
                                          | a ^
                                          v b | b
                                        S(2) ---------> T(1)
                                          | a ^
                                          v b | b
                                        S(3) ---------> T(2)
                                          | a ^
                                          v b | b
                                        ... ...


It's easy to show this is minimal. The only possible equivalent states here
are (apart from equality):


                                            S(n) = S(n'), 0 < n < n'
                                            T(n) = T(n'), 0 < n < n'


Let N be the smallest integer, n, such that either of the equivalences above
holds. If
                                                    S(N) = S(n'),
then
                          b T(N) + a S(N+1) = b T(n') + a S(n'+1),
and thus
                                                    T(N) = T(n').


If
                                                    T(N) = T(n')
then
                                            b T(N-1) = b T(n'-1).
Thus
                                                T(N-1) = T(n'-1),


which is impossible by assumption. This Automaton represents a special
case of what I call an Indexed Infinite Automaton (or, committing a major
abuse of terms, an Infinite Finite Automaton), and the grammar is an Index
Grammar of degree 1 and dimension 1. The automaton has the topology of a
1-ary tree.


      If you ever studied Quantum Theory, you should have experienced a
feeling of deja vu. It actually looks like we're talking about quantum
states, energy levels, and operators, and that resemblance you'll see is a
little bit more than a passing resemblance. The states S(0) and T(0) look
like ground states, the transition S(n) >= a S(n + 1) like the absorption
of a quantum of energy, and the transition T(n) >= b T(n - 1), the
emission of a quantum of energy.


      If you introduce some quantum notation (bras and kets) and apply a
similar set of formal properties to these operators, the result will be
that you've just come up with a way to represent solutions to recursive
systems of equations.


      As you'll see later on, the expression above will be represented by the
expression:
                                            <0| (<1| a)* (b |1>)* |0>


and thus, you can actually solve the equation S = 1 + a S b. The <n|'s
are Creation operators, and the |n>'s are Anhilliation operators. There
are Commutation and Orthogonality relations. For example, you can perform
the following manipulations:


                                            <0| (<1| a) (b |1>) |0>
                                        = <0| <1| |1> |0> a b
                                        = <0| |0> a b
                                        = a b


and
                                      <0| (<1| a) |0> = <0| <1| |0> a = <0| 0 a = 0


      To summarize development in one term: you now have a generalization of
Regular Expressions, known as Context Free Expressions...
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.