Fri, 17 Apr 2009 11:05:31 +0200

Related articles |
---|

[4 earlier articles] |

Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-03-31) |

Re: additional regular expression operators rpboland@gmail.com (Ralph Boland) (2009-03-31) |

Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-14) |

Re: additional regular expression operators zayenz@gmail.com (MZL) (2009-04-15) |

Re: additional regular expression operators anton@mips.complang.tuwien.ac.at (2009-04-16) |

Re: additional regular expression operators gah@ugcs.caltech.edu (glen herrmannsfeldt) (2009-04-16) |

Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-17) |

Re: additional regular expression operators mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-04-17) |

From: | torbenm@pc-003.diku.dk (Torben =?iso-8859-1?Q?=C6gidius?= Mogensen) |

Newsgroups: | comp.compilers |

Date: | Fri, 17 Apr 2009 11:05:31 +0200 |

Organization: | Department of Computer Science, University of Copenhagen |

References: | 09-03-111 09-03-123 09-04-001 |

Keywords: | lex |

Posted-Date: | 17 Apr 2009 06:12:28 EDT |

Ralph Boland <rpboland@gmail.com> writes:

*> I have never found a useful example of set intersection though.*

In lexers for programming languages, such examples are, indeed, rare

-- as are any nontrivial regular expression. The most complex are

usually for floating-point numbers, comments and strings.

But you also use regular languages to specify properties for state

machines, and if you want to check that two properties are both

fulfilled, you can do so by checking with the intersection property

instead of once for each property.

*> I have found so far only one paper on implementing these operators*

*> and it is complex. I will take a look at derivative based algorithms.*

Here is a short introduction:

We define three functions of regular expressions:

Nullable(r) is true if r accepts the empty string, false otherwise.

First(r) is the set of characters that can start strings in L(r).

Next(r,c) is a regular expression s that accepts a string w if and

only if r accepts the string cw.

We can define these structurally on regular expressions in this way:

Nullable(\epsilon) = true

Nullable('a') = false

Nullable(r|s) = Nullable(r) or Nullable(s)

Nullable(rs) = Nullable(r) and Nullable(s)

Nullable(r*) = true

Nullable(r+) = Nullable(r)

Nullable(r&s) = Nullable(r) and Nullable(s)

where r&s is the intersection of r and s.

First(\epsilon) = {}

First('a') = {'a'}

First(r|s) = First(r) U First(s)

First(rs) = First(r) U First(s), if Nullable(r)

First(rs) = First(r), if not Nullable(r)

First(r*) = First(r)

First(r+) = First(r)

First(r&s) = First(r) \cap First(s)

where \cap is the set intersection operator.

Next(\epsilon, c) = \phi

Next('a', 'a') = \epsilon

Next('a', c) = \phi, if c != 'a'

Next(r|s, c) = Next(r, c)| Next(s, c)

Next(rs, c) = Next(r, c)s | Next(s, c), if Nullable(r)

Next(rs, c) = Next(r, c)s, if not Nullable(r)

Next(r*, c) = Next(r, c)r*

Next(r+, c) = Next(r, c)r*

Next(r&s, c) = Next(r, c) & Next(s, c)

where \phi is the regular expression that accepts no strings. We can

use the following reduction rules:

\phi | r = r

r | \phi = r

\phi r = \phi

r \phi = \phi

(\phi)* = \epsilon

(\phi)+ = \phi

\phi & r = \phi

r & \phi = \phi

We can now build a DFA from a regular expression by marking satets

with regular expressions:

- The initial state is marked with r

- There are outgoing edges from r on all symbols c in First(r)

- The transition from r on c is to Next(r,c)

- A state marked r is acception if Nullable(r)

To avoid getting an infinite DFA, you need to reduce regular

expressions so you won't get an infinite number of equivalent regular

expressions. The more reduction rules you use, the smaller your

resulting DFA will be. A good idea is to define some (arbitrary)

total order on regular expressions and make sure r<s in r|s and r&s.

If r=s, both can be reduced to r. Some useful rules are:

(r|s)|t = r(s|t)

(rs)t = r(st)

\epsilon r = r

r \epsilon = r

\epsilon* = \epsilon

r(r*) = r+

r(r+) = r+

r+ | \epsilon = r*

r* | \epsilon = r*

Torben

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.