Re: thompson's construction: superfluous epsilon transitions ?

torbenm@diku.dk (Torben Ęgidius Mogensen)
15 Apr 2004 12:26:50 -0400

          From comp.compilers

Related articles
thompson's construction: superfluous epsilon transitions ? spur4444@yahoo.com (2004-04-03)
Re: thompson's construction: superfluous epsilon transitions ? torbenm@diku.dk (2004-04-15)
Re: thompson's construction: superfluous epsilon transitions ? torbenm@diku.dk (2004-04-21)
| List of all articles for this month |

From: torbenm@diku.dk (Torben Ęgidius Mogensen)
Newsgroups: comp.compilers
Date: 15 Apr 2004 12:26:50 -0400
Organization: Department of Computer Science, University of Copenhagen
References: 04-04-022
Keywords: parse
Posted-Date: 15 Apr 2004 12:26:50 EDT

spur4444@yahoo.com (Spur) writes:
> I have seen Thompson construction diagrams of two kinds -
> minimalistic, and with additional Eps edges, probably for clarity.
>
> For example, the following are two ways to represent the concatenation
> of a and b:
>
> (S0)---a-->(S1)---b-->(F)
>
> (S0)---a-->(S1)---Eps-->(S2)---b-->(F)
>
> Similarly, there are two ways to represent a|b, one with just 2 states
> and two edges (a and b) between them, and another with 6 states,
> padded with Eps edges all along.
>
> Thus, I want to ask whether the Kleene transition (a*) can be made
> simpler ?
> Why can't it be just:
>
> <~~~~Eps~~~~~~|
> (S0)---a,Eps-->(F)
>
> That is, from the start state edges to the final state on a or Eps,
> and back on Eps ?
>
> It looks like this will accept a* too, doesn't it ? So why isn't it
> presented as an alternative to the "normal" 4 state representation ?
>
> The only problem I see here is the "Eps loop", i.e. Eps leads from S0
> to F, and Eps leads from F to S0. Is this illegal in NFAs ? I guess it
> can be a problem because to any accepting path and infinite string of
> S0->F->S0->F... may be added since the transitions are Eps. But can't
> we define the accepting string as the minimal and solve this problem ?




Thompsons construction is compositional, i.e., you combine NFA's
without knowing anything about what regular expressions they
represent. The classical construction has the following invariants:


  1) Each NFA used in the construction has unique start and end states.


  2) There are no transitions from the NFA to its start state.


The second invariant allows the concatenation rule to combine the end
state of the first NFA with the start state of the second NFA. Let us
assume that the invariant did not hold: For example let us assume that
NFA 1 is
        ___
      / a \
    V \
(S0)-a->(S1)


and NFA 2 is
        ___
      / b \
    V \
(S2)-b->(S3)


If we concatenating these by combining S1 and S2 to S4, we would get


        ___ ___
      / a \ / b \
    V \ V \
(S0)-a->(S4)-b->(S3)


which would recognize erroneous strings (e.g., abbaab).


If you combine start and end states in the alternative construction
you might get wrong results if there are transitions out of the end
states into the component NFA's, but you could get around this by
adding an extra invariant that says this doesn't occur. You will have
to make sure that your constructions preserve all invariants, though.


To conclude, you can have variants of Thompsons construction with
different number of invariants. More invariants allow more states to
be merged when NFA's are combined, but may require additional states
and epsilon transitions to preserve the invariants. The optimum
depends on the mix of different constructions (e.g., how often Kleene
star is used compared to concatenation or alternative).


You will find special cases where the NFA's you combine obey stronger
properties than the invariants guarantee, and in these cases you can
save a few states or transitions. Your examples of a|b or a* are in
this category.


Torben Mogensen


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.