Tue, 5 Jan 2010 01:58:15 +0000 (UTC)

From: | Kaz Kylheku <kkylheku@gmail.com> |

Newsgroups: | comp.compilers |

Date: | Tue, 5 Jan 2010 01:58:15 +0000 (UTC) |

Organization: | A noiseless patient Spider |

References: | 10-01-013 |

Keywords: | lex, DFA |

Posted-Date: | 05 Jan 2010 13:36:03 EST |

On 2010-01-03, Peng Yu <pengyu.ut@gmail.com> wrote:

*> Suppose that I have two regular sets A and B that are generated by two*

*> regular expressions. I'm wondering if there is always a regular*

*> expression that can generated the difference between the two sets,*

*> i.e., A-B. If there is such regular expression, is there a mechanical*

*> way to generated such regular expression.*

The following paper, which is actually not a paper but a set of solutions

to a midterm examination, has some important clues.

It gives two algoirthms which are defined over NFA's: one for

computing the complement of an NFA (~A), and one for computing the

intersection of two NFA's, A&B.

The difference A-B can be given by this equivalence:

A-B <=> A&(~B)

I.e. applying both algorithms. In fact this is one of the questions

in this midterm:

http://cs164fa09.pbworks.com/f/midterm1_solutions.pdf

Actually, by de Morgan's we should be be able to eliminate &, right?

A&(~B) <=> ~(~A|~(~B)) <=> ~(~A|B)

So it looks like all we need is the complement to do difference.

Constructing the complement over NFA's is far simpler

than going from NFA -> DFA and back.

Complement is simple, in particular: the complemented NFA inherits all

of states and transitions of the original NFA. The meaning of the

states is inverted: non-accepting states in the original become

acceptance states, and vice versa. A new failure state is added which

is an acceptance state (due to the inverted sense of the

complement). Extra transitions are added such that for any state in

the original NFA which has no transition on some input character C, a

transition is added for that state and character which takes it to

this failure state. This rule is applied to the failure state itself,

which self-transitions on all inputs, serving as a ``honeypot'' for

arbitrarily long suffixes. I.e. every explicit failure in the

original NFA (no transition on some character) is routed to this

failure state which has accept semantics under the complement, and

whenever the original NFA reaches an accepting state, in the

complement NFA, this event corresponds to reaching a non-accepting

state. Thus success becomes failure and vice versa.

This still leaves you with the problem of reconstructing a regular

expression from an NFA.

A workable approach for doing the complement, difference, etc,

directly over the regular expression may be to exploit derivatives

somehow.

The derivative of a language L with respect so some string u, is the

set of suffixes of all strings in L which have u as a prefix, with the

u prefix removed. E.g if the languages is { abc, def, deg }, its

derivative with respect to de is { f, g }.

In consideration of a language being generated by a regex, the derivative

concept can be applied to the regex. For instance the derivative of

the regex abb* with respect to ab is the regex b* (which can derive the

empty string). The derivative can sometimes be the empty regex e,

or a special null value which means ``empty set'' or ``matches nothing''.

E.g. the derivative of the regex a* with respect to the string b is

this null value, which may be notated 0. We can write this as

d(b,a*) = 0.

Derivatives can be used to match regular expressions, as an alternative to NFA

construction, and quite readily support operations like complement and

intersection. They can be translated directly to DFA.

It looks like there should be a way to construct a complement, difference or

intersection using derivatives somehow.

Useful paper on derivatives: _Regular Expression Derivatives

Re-Examined_, Owens, Reppy, Turon:

http://www.ccs.neu.edu/home/turon/re-deriv.pdf (and its references).

Using pencil and paper, I think I found the start of a workable

algorithm for computing the compelment of a regex using derivatives,

directly. A similar approach should work for intersections.

Using a four letter {a, b, c, d} alphabet, I computed the following

example complement:

complement(a(abc)*d) -> (|[bcd]|a(abc)*[bc]|a(abc)*d[abcd]+|

aab[abd][abcd]*|aa[acd][abcd]*)

On informal inspection this seems right (do you see any mistakes).

I.e. an empty string mismatches, since the regex derives only nonempty

strings, so we include it. The the one-letter strings b, c and d do

not match, because the regex has a null derivative with respect to

these. An a followed by by zero or more repetitions of "abc",

followed by something other than an a or d (i.e. [bc]) does not match

and so is a branch of of the complement. And so forth.

The ``pre-algorithm'', loosely speaking, is to consider the space of

possible inputs, letter by letter. For each letter, compute the

derivative of the regex with respect to that letter. For all letters

deriving a 0 (null), you can construct a set; this is the set matched

by the complement regex at that character position: add this as a

disjunction to the complement regex, remembering that in the branch

you must match the prior characters leading up to that position). The

* operator is handled specially, along these lines. When the regex is

of the form (r1)*.r2 (I.e. a kleene followed by more regex material)

the complement should match zero or more r1, and then a character

which may not begin r1 or r2. Another branch matches (r1*), and then

continues the process. Of course, cases also have to be included for

mismatching /into/ r1, character by character. I don't have all this

worked out in detail, but just playing with pencil and paper

derivations, the concept seems promising.

In the above example, the rule gives us the two branches a(abc)*[bc],

and a(abc)*d[abcd]. In the first one, the complement regex matches

leading a, the string abc zero or more times, but then a "bc" (letters

which do not begin another "abc" repetition, nor match the "d" suffix

which follows). In the second one, the complement matches a leading a

with zero or more of the "abc" repeat, then a d which branches to the

next part of the pattern, but then an extra letter (wildcard of the

whole alphabet) which spoils the match. How that last one works is

that the derivative of d with respect to a, b, and c derives 0, so we

match a d (the complement of those three). Then we must match the

complement of the epsilon regex which is any letter, one or more

times. I.e. the complement regex must match all strings which are

prefixes of the regular regex, but have incompatible suffixes: d

followed by junk.

The mismatches through the (abc) part of *(abc) are reflected in

the branches aa[acd][abcd]* and aab[abd][abcd]*. I.e. the complement matches

the leading a, and then "a" (possible start of "abc") followed by something

other than b, and also the leading a, then "ab", followed by something other

than c, failing to complete the "abc". Both branches have to allow arbitrary

amount of junk, all of which mismatches the original regex and so matches

the complement.

Without a question, character class syntax with copmlements is essential, since

for an actual character set, the complement spaces are huge; e.g. we want to

encode [abcd] simply as . ; the equivalent of literally writing [abcd] even in

ASCII will kill us, not to mention Unicode.

In our above example, we really want to write it like this:

complement(a(abc)*d) -> (|[^a]|a(abc)*[^ad]|a(abc)*d.+|aab[^c].*|aa[^b].*)

Now it works for an alphabet which includes symbols other than just a, b, c

and d.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.