Re: Regular expressions speedup

Cleo Saulnier <cleos@nb.sympatico-dot-ca.remove>
13 Aug 2005 00:26:26 -0400

          From comp.compilers

Related articles
[3 earlier articles]
Re: Regular expressions speedup haberg@math.su.se (2005-08-10)
Re: Regular expressions speedup bonzini@gnu.org (Paolo Bonzini) (2005-08-10)
Re: Regular expressions speedup dot@dotat.at (Tony Finch) (2005-08-10)
Re: Regular expressions speedup kszabo@bcml120x.ca.nortel.com (2005-08-10)
Re: Regular expressions speedup jburgy@gmail.com (2005-08-10)
Re: Regular expressions speedup torbenm@diku.dk (2005-08-13)
Re: Regular expressions speedup cleos@nb.sympatico-dot-ca.remove (Cleo Saulnier) (2005-08-13)
| List of all articles for this month |

From: Cleo Saulnier <cleos@nb.sympatico-dot-ca.remove>
Newsgroups: comp.compilers
Date: 13 Aug 2005 00:26:26 -0400
Organization: Aliant Internet
References: 05-08-02305-08-029 05-08-046
Keywords: lex
Posted-Date: 13 Aug 2005 00:26:26 EDT

jburgy@gmail.com wrote:
> Cleo Saulnier wrote:
>
>>Well, I've looked at DFA and it's great if you don't have
>>backreferences. But I think I figured out how to resolve this
>>problem. However, now I have a new problem with the backreferences.
>>
>>(a|b)*(bab)+?
>>
>>input string: aaababbabbab
>>
>>Assume that I need the two backreferences.
>>
>>The 1st backreference will match: b (full match is: aaababbab)
>>and the 2nd one will match: bab (at end of string)
>>
>>I'm currently at a loss how to make sure the above is matched and not
>>have it stop prematurely because of the lazy matching at say aaabab
>
>
> Hi there,
>
> I may not be reading your post correctly but it sounds like you're
> talking about submatches, not backreferences. A backreference is when
> you use a submatch (or capturing parenthesis) as an element in your
> regexp. The canonical example is searching for repeated words, e.g.:
>
> \<([A-Za-z]+) +\1\>
>
> (assuming \< \> represent word boundaries. If that's your problem,
> please ignore me. If you're trying to resole the inherent ambiguity
> that arises when matching subpatterns, I suggest you read the only
> serious treatment I found of it: Ville Laurikari's master thesis
> "Efficient submatch addressing for regular expressions"
>
> ville.laurikari.net/regex-submatch.pdf
>
> Ville went on to release an excellent regexp matching library called
> TRE
>
> laurikari.net/tre/
>
> Maybe you want to collaborate with him rather than reinvent your own
> wheel. Or maybe you're doing it as a learning experience and I wish you
> good luck :)


Thanks, this is a first rate reply. You are right about everything in
your post. I did mean capturing pars, but I did intend on using them
later in the same regex (but showing that was getting a bit complex if
you think about the process involved). I just wanted to show how the
difficulty of capturing is complicated enough as is and just gets
compounded with backreferences.


I was thinking multiple stacks (to keep track of possible submatches),
but the memory requirements would be a little extreme in some cases.


I'm doing this for two reasons: One, yes that I want a learning
experience. I really enjoy this stuff. And I want to use it in my
canonical LR(1) parser for token parsing. And two, I wanted to write a
regex library with the least legalities to it and be as simple as
possible to use. So I put it out in the public domain (except my name
should remain in the source). But I'm not fussy.


I'll read up on the info you gave me and see where to go from there.
However, I've found that a well written regex (the string) can be very
fast as NFA, especially if backreferences are used... compared to the
work involved in accomplishing the same task with DFA.


Cle


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.