Re: Regular expression string searching & matching

Ben Hanson <jamin.hanson@googlemail.com>
Wed, 7 Mar 2018 11:53:15 -0800 (PST)

          From comp.compilers

Related articles
Regular expression string searching & matching clint.olsen@gmail.com (Clint O) (2018-03-04)
Re: Regular expression string searching & matching jamin.hanson@googlemail.com (Ben Hanson) (2018-03-07)
Re: Regular expression string searching & matching jamin.hanson@googlemail.com (Ben Hanson) (2018-03-07)
Re: Regular expression string searching & matching clint.olsen@gmail.com (Clint O) (2018-03-08)
Re: Regular expression string searching & matching clint.olsen@gmail.com (Clint O) (2018-03-10)
Re: Regular expression string searching & matching jamin.hanson@googlemail.com (Ben Hanson) (2018-03-10)
Re: Regular expression string searching & matching jamin.hanson@googlemail.com (Ben Hanson) (2018-03-11)
Re: Regular expression string searching & matching clint.olsen@gmail.com (Clint O) (2018-03-12)
[8 later articles]
| List of all articles for this month |

From: Ben Hanson <jamin.hanson@googlemail.com>
Newsgroups: comp.compilers
Date: Wed, 7 Mar 2018 11:53:15 -0800 (PST)
Organization: Compilers Central
References: 18-03-016
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="18151"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 07 Mar 2018 14:59:08 EST

[/][*]([^*]|[*]+[^*/])*[*]+[/]


is what you are looking for. I ran into this when developing my lexer
generator library lexertl in C++. Having a debug::dump() function
really helped me grok what was going on.


The trick of course is realising that you have to exclude the
characters that follow (i.e. the [^*/] part). That is the bit that
clobbers the greedy behaviour. I've had to remind myself of that on
more than one occasion recently!


Hope that helps.


Regards,


Ben


On Sunday, 4 March 2018 14:41:17 UTC, Clint O wrote:
> Hi:
>
> I'm working on an implementation of Brzozowski derivatives[1] to translate
> regular expressions into a DFAs for potential use as a RE debugger and
> programming tool. I've been using the paper "Regular-expression derivatives
> reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.
>
> One of the interesting things about using derivatives is that it allows for
> some elegant RE set notation. So, rather than just the union -
>
> r: a | b # a OR b
>
> we also have set intersection and complement as well.
>
> I've got things working pretty well, and I'm trying to test things out.
> Unfortunately, I haven't seen much discussion on how string search works,
> except for what little I know about the behavior of flex and grep.
>
> Match is pretty straightforward: You begin in the starting state q0 and read
> the input and follow the state transitions. If you get to the end of the
> string without hitting the error state and you are in an accepting state
> (Brzozowski refers to states as "nullable"), then this string is accepted by
> this RE.
>
> Search is a different beast: Here we have a string of input and we are
> interested in knowing if any substring in the input matches our RE (like
> grep).
>
> The naive approach I have taken has been to start the DFA at every possible
> position in the string, and start simulating the DFA. If I hit any accepting
> states, I record the start, stop offsets. If I hit the error state or end of
> string, I move to the next starting position. I then merge all of the
> intervals at the end to cover the cases where I hit an accepting state
> multiple times for a given start point. Note: I assume this is the faithful
> way to do it because it's my understanding that we should aim for the longest
> possible match. If I stop at the first accepting state, I will miss
> potentially longer matches.
>
> I hit an interesting case when trying out finding C-comments. Owens mentions
> that with complement you can find non-nested C-comments with:
>
> /\* ~(.* \*/ .*) \*/
>
> and this works for me. However, I wasn't able to find an RE _without_ using
> complement that exhibits identical behavior because of the inherent greedy
> nature of REs that make it scan through successive non-overlapping comments in
> the text until the final closing '*/'.
>
> There was an interesting stackoverflow discussion[3] on this with links to
> explanatory regex interpreters, and when I tried:
>
> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
>
> for me it continues through the input until it finds the _last_ '*/'.
>
> I also tried constructing my own versions and came up with similar results.
>
> Questions:
>
> 1) Am I applying the RE/DFA search algorithm correctly?
>
> 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
> an RE->DFA implementation? A quick search made it sound like only backtracking
> implementations can implement this behavior?
>
> Thanks,
>
> -Clint
>
>
> [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
> of the ACM, 11(4), 481-494.
>
> [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
> re-examined. Journal of Functional Programming 19 (2), 173-190
>
> [3]
> https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.