Re: re2c-1.0 released!

Ulya Trofimovich <skvadrik@gmail.com>
Sun, 3 Sep 2017 09:00:34 +0100

          From comp.compilers

Related articles
re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-08-27)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-02)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-02)
Re: re2c-1.0 released! anton@mips.complang.tuwien.ac.at (2017-09-02)
Re: re2c-1.0 released! gneuner2@comcast.net (George Neuner) (2017-09-02)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-03)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-04)
[1 later articles]
| List of all articles for this month |

From: Ulya Trofimovich <skvadrik@gmail.com>
Newsgroups: comp.compilers
Date: Sun, 3 Sep 2017 09:00:34 +0100
Organization: Compilers Central
References: 17-08-007 17-09-004
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="38489"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, design
Posted-Date: 03 Sep 2017 11:55:06 EDT
Content-Language: en-US

> I don't know what this means ... every NFA has an equivalent DFA, so
> if a problem can be solved with NFA, it can be solved with DFA.


My wording is bad, sorry. I should have put it this way: regular
expression parsing (and submatch extraction as a special case of it) can
be solved on nondeterministic finite state transducers (NFST). And NFST
are not equivalent to DFST, despite the fact that NFA are equivalent to DFA.


Traditional NFA can be trivially converted to NFST by augmenting their
transitions with output symbols and extending the simulation procedure
to write these output symbols. In case of submatch extraction it means
equipping each path through NFA with a set of registers -- a pair per
capturing group -- that are updated during simulation.


However, NFST determinization is non-trivial. That was my point.


>> It may seem that only a lack of effort prevents developers of lexer
>> generators like Flex from implementing it (as well as fixing the
>> ever-broken trailing contexts [3]).
>
>Lack of justification due to, I suspect, lack of demand from
>sophisticated users of the tools, who use them for their intended
>purpose.


Well, the simple use case that I had in my mind was re2c's own lexer
(yes, re2c is self-hosted): it has a lexeme of the form
"{[0-9]+,[0-9]+}" that is used to denote bounded repetition of the given
regular expression. Without submatch extraction the rule looks like this:


        "{" [0-9]+ "," [0-9]+ "}" { ... strchr(token, ',') ... }


It is necessary to call 'strchr' to retrieve the position of the comma.
It would be nice if the lexer stored the position in a variable as it goes:


        "{" [0-9]+ @p "," [0-9]+ "}" { ... p /* points to ',' */ ... }


But there are other, more complex use cases, like parse URI or HTTP
messages:


http://re2c.org/examples/example_10.html
http://re2c.org/examples/example_11.html


This technique is already applied by re2c users, so at least someone
somewhere finds it handy. ;)




Ulya


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.