Re: compiling case insensitive regular expressions

glen herrmannsfeldt <gah@ugcs.caltech.edu>
Fri, 5 Nov 2010 04:20:11 +0000 (UTC)

          From comp.compilers

Related articles
compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-01)
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-03)
Re: compiling case insensitive regular expressions benhanson2@icqmail.com (2010-11-03)
Re: compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-04)
Re: compiling case insensitive regular expressions rsc@swtch.com (Russ Cox) (2010-11-04)
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-05)
Re: compiling case insensitive regular expressions cr88192@hotmail.com (BGB) (2010-11-06)
| List of all articles for this month |

From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Newsgroups: comp.compilers
Date: Fri, 5 Nov 2010 04:20:11 +0000 (UTC)
Organization: A noiseless patient Spider
References: 10-11-004 10-11-006 10-11-009
Keywords: lex, performance
Posted-Date: 07 Nov 2010 00:39:15 EDT

Armel <armelasselin@hotmail.com> wrote:
(snip, I wrote)


>> Another way [...] is to supply a bit mask for each character being
> compared. Only bits with a '1' in the mask are used in the comparison.


> I used that technic to build a LL(1) based Z80 disassembler/decompiler years
> ago, to decode "structural bits" (vs. values bits) in instructions, but I
> don't think its applicable to a AFD-based regular expression engine, before
> knowing which bits can be ignored the AFD will first need to determine a
> class for the character. it will simply result in a "tolower(c)" (or
> toupper(c) ) called on each character of the input, but expressed and coded
> in another way. The preparation phase still has to verify that there are no
> two paths leaving a state with 'a' and 'A' for example (as they would become
> a single path once the mask applied).


Yes, it would have to be built into the evaluation engine. In the
case of the Paracel FDF, it is part of the hardware, and also part of
PSL (Pattern Specification Language).


Also, the PSL compiler will figure out the optimal mask for a given
character position. If, for example, one put [bc] where the
characters only differ by one bit, the compiler would generate 'b'
with a mask of X'3e'. (The default mask for ASCII masks the high bit.
Case insensitivity is the default for lower case query characters.)


It seems to me that it could result in more compact compiled regular
expressions, that also might execute faster.


-- glen


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.