Re: compiling case insensitive regular expressions

Russ Cox <rsc@swtch.com>
Thu, 4 Nov 2010 15:15:24 -0400

          From comp.compilers

Related articles
compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-01)
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-03)
Re: compiling case insensitive regular expressions benhanson2@icqmail.com (2010-11-03)
Re: compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-04)
Re: compiling case insensitive regular expressions rsc@swtch.com (Russ Cox) (2010-11-04)
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-05)
Re: compiling case insensitive regular expressions cr88192@hotmail.com (BGB) (2010-11-06)
| List of all articles for this month |

From: Russ Cox <rsc@swtch.com>
Newsgroups: comp.compilers
Date: Thu, 4 Nov 2010 15:15:24 -0400
Organization: Compilers Central
References: 10-11-004
Keywords: lex, practice
Posted-Date: 04 Nov 2010 22:25:04 EDT

> I need to compile regular expressions which are case insensitive,
> there are two cases, the part which must be matched case insensitvely
> might be just a portion but it can be the entire RE as well. B The RE
> will be Unicode enabled and must be compiled to AFD.
>
> I am wondering what is the "best practice" in this field, is it
> generally more efficient to precompute for each RE its "case
> insensitive RE" (by replacing each symbol by a ORed version of it,
> i.e; hello becomes (h|H)(e|E)(l|L)(l|L)(o|O)..) and compile that in
> place of the original or is it as good to simply replace the input
> symbols by their lowercase version and compile a "lowercase only"
> version of the RE?


My experience has been that you almost always want to do it in
the RE, not by fiddling with the input stream.


* If only some of the regexp is case-insensitive, transforming the
    input is incorrect.


* A tight byte-at-a-time DFA inner loop slows down noticeably
    (10% or so on a typical x86) when you have the extra conditional
    branch to turn upper case into lower.


* Computing "ToLower" of an arbitrary Unicode code point is
    quite expensive. If you bake it into the RE and then the DFA
    then the work happens during table construction, and you get
    much faster execution.


All that said, it is also the case that turning /Hello, world/ into
/[Hh][Ee][Ll][Ll][Oo], [Ww][Oo][Rr][Ll][Dd]/ is perhaps not the best
use of memory, especially if you have an inefficient representation
of character classes and an efficient representation of literal strings.


In RE2 (http://code.google.com/p/re2/) I introduced a bit in the parsed
RE representation to mean "ASCII case-insenstiive" so that I could
continue to store the parsed form as the more compact literal strings
where possible. Even so, ASCII is really the only easy case. If you
type in /=CE=9A=CE=B1=CE=BB=CE=B7=CE=BC=CE=AD=CF=81=CE=B1 =CE=BA=CF=8C=CF=
=83=CE=BC=CE=B5/ then RE2 turns it into a sequence of
character classes.


One final warning: even ASCII is not completely straightforward:
according to the Unicode spec, a case-insensitive match for /sky/ should
match =C5=BFKy - that's a long s, a Kelvin symbol, and an ordinary y.


Russ


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.