Tue, 31 Mar 2009 18:07:38 -0700 (PDT)

Related articles |
---|

additional regular expression operators rpboland@gmail.com (Ralph Boland) (2009-03-29) |

Re: additional regular expression operators m.helvensteijn@gmail.com (2009-03-30) |

Re: additional regular expression operators zaimoni@zaimoni.com (2009-03-30) |

Re: additional regular expression operators haberg_20080406@math.su.se (Hans Aberg) (2009-03-30) |

Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-03-31) |

Re: additional regular expression operators rpboland@gmail.com (Ralph Boland) (2009-03-31) |

Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-14) |

Re: additional regular expression operators zayenz@gmail.com (MZL) (2009-04-15) |

Re: additional regular expression operators anton@mips.complang.tuwien.ac.at (2009-04-16) |

Re: additional regular expression operators gah@ugcs.caltech.edu (glen herrmannsfeldt) (2009-04-16) |

Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-17) |

Re: additional regular expression operators mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-04-17) |

From: | Ralph Boland <rpboland@gmail.com> |

Newsgroups: | comp.compilers |

Date: | Tue, 31 Mar 2009 18:07:38 -0700 (PDT) |

Organization: | Compilers Central |

References: | 09-03-111 09-03-123 |

Keywords: | lex |

Posted-Date: | 04 Apr 2009 09:44:40 EDT |

On Mar 31, 5:30 am, torb...@pc-003.diku.dk (Torben Fgidius Mogensen)

wrote:

*> Ralph Boland <rpbol...@gmail.com> writes:*

*> > I plan to provide support for at least three additional*

*> > unary operators to the standard three (?,*, and +)*

*> > (plus new binary operators but that's a topic for another day).*

*> > For a regular expression R they are:*

*>*

*> > R! :does not accept the empty string but otherwise*

*> > accepts every expression that R does. Very useful.*

*>*

*> > R~ :accepts exactly the set of strings that R does not accept.*

*> > note that R = R~~.*

*>*

*> > R% : equivalent to the string (R~)R*

*>*

*> I can see the usefulness of the first two, but not really the third.*

*> Do you have an example?*

It's useful in pattern matching where we want to find the first

occurrence of pattern R. Admittedly this only gives us the

point where R ends and we still need to find where R starts.

It is also useful in scanners where we have comments.

If a comment starts with <: and ends with :> the

definition of a comment would be:

<:(:>)%.

The rarity of applications of the % operator could admittedly

be used to argue that the % operator is not worth implementing.

After all, we could have instead written: <:(:>)~:>.

*>*

*> In general, I find R\S (strings in R but not in S) more useful than*

*> the above, and you can easily define the above using this: R! = R\e,*

*> where e is the empty string and R~ = (.*)\R, where . represents any*

*> character.*

I intend to implement set difference but as it is not a unary operator

it was not the subject of my posting. I don't have an explicit symbol

for the empty string and am not planning to add it. If find the '!'

operator very useful and would still want it even if I could write R\e

as you propose. To me, allowing an e symbol is only useful if you

plan to get rid of both '?' and '!' operators, which is not my

preference. While it is true that R~ = (.*)\R I prefer to support

both operators.

*> > I have put these operators after the expression but in fact I prefer*

*> > to have them in front*

*> > because:*

*> > 1) I prefer to have unary operators in front of their expressions*

*> > as in "-5" rather than "5-".*

*>*

*> That is a matter of taste, but you would IMO need more compelling*

*> reasons to change what has been standard for half a century.*

Yes, I know. I prefer unary operators to be prefix but momentum

is very much against me. I always expected in the end to conform

to the norm and so far this forum has provided me with no reason

to do otherwise.

*>*

*> > 2) Parsing is easier and faster if the unary operator precedes*

*> > the expression.*

*>*

*> Not really. If that was the case, early mathematical calculators*

*> would have used prefix sine, log, etc., but even calculators with*

*> infix arithmetic and precedence levels (such as Texas SR51) used*

*> postfix unary operators.*

An interesting point but contrary to my experience. If I write a

grammar the prefix unary operators are easy to handle with both LL(1)

based parsing and LR(1) based parsing. With postfix operators I have

problems with needing to recognize an expression too soon. Examples

of problems with LL(1) parsing are easy to construct. I haven't

written an LR(1) grammar in a while so I can't remember the exact

problems I had but I do remember having them and thinking: "This would

be so much simpler if the unary operators were prefix". I will be

developing a system where the user can define his own operators. My

utilities will then need to generate the grammars based upon the

defined operators (and the standard ones). It will matter that the

grammars be as simple and regular as possible.

*> > Suggestions as to further operators to support welcome. Suggestions*

*> > as to what are the best symbols to use for the new operators welcome.*

*>*

*> As I said, R\S is very useful. I have also often found use for "a*

*> sequence of Rs separated by <symbol>", such as a list of names*

*> separated by commas. Something like R_s, where s is the symbol could*

*> be a possible notation, e.g. ([A-Za-z]*)_',' for the above example.*

*> R_s expands to (Rs)*R, but since R occurs twice in this, you can*

*> potentially reduce the size of the regexp a lot.*

This is the list operator which for me will be a binary operator with

regular expressions on each side. Admittedly one of the operands is

usually a symbol but I see no reason for this restriction. This is a

very useful operator that should be included in any regular expression

based utility. There are also efficiency reasons for providing the

list operator; just try applying the list operator multiple times as

in (R $ S $ T $ U) where $ is the list operator and then expanding the

operator out where R # S = (RS)*S. Then look as what the

corresponding minimum finite state machine looks like.

*> Also, you sometimes want a sequence of things in arbitrary order. So*

*> something like {R,S,T}% as an abbreviation of RST|RTS|SRT|STR|TRS|TSR*

*> would be useful. But it can lead to extremely large expanded*

*> expressions or automatons.*

I agree that this is useful but I prefer the notation: R @ S @ T.

Note that: R @ S @ T != (R @ S) @ T != R @ (S @ T).

*> Note that once you have complement or difference, the construction of*

*> automatons get more complicated, as you can't construct an NFA*

*> compositionally and then convert this to a DFA, you have to construct*

*> a DFA directly. There are standard methods for this based on*

*> derivatives of regular expressions, though, so it is not a major*

*> issue. It just means that you can't easily modify an existing tool.*

I am writing my tools from scratch. I am aware that complement

and the other set based operators are a major task to implement.

But complement and set difference are too important to leave out.

I have never found a useful example of set intersection though.

I have found so far only one paper on implementing these operators

and it is complex. I will take a look at derivative based algorithms.

Thanks for your comments.

Ralph Boland

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.