Re: is lex useful? (Scott Nicol)
27 Jun 1996 11:36:48 -0400

          From comp.compilers

Related articles
[9 earlier articles]
Re: is lex useful? (Stefan Monnier) (1996-06-26)
Re: is lex useful? (1996-06-26)
Re: is lex useful? (1996-06-26)
Re: is lex useful? (Jerry Leichter) (1996-06-27)
Re: is lex useful? (Scott Stanchfield) (1996-06-27)
Re: is lex useful? (1996-06-27)
Re: is lex useful? (1996-06-27)
Re: is lex useful? 72510.2757@CompuServe.COM (Stephen Lindholm) (1996-06-27)
Re: is lex useful? (1996-06-27)
Re: is lex useful? (1996-06-30)
Re: is lex useful? Robert.Corbett@Eng.Sun.COM (1996-06-30)
Re: is lex useful? (1996-06-30)
Re: is lex useful? (1996-06-30)
[7 later articles]
| List of all articles for this month |

From: (Scott Nicol)
Newsgroups: comp.compilers
Date: 27 Jun 1996 11:36:48 -0400
Organization: Information Advantage
References: 96-06-073 96-06-105 96-06-108
Keywords: lex, i18n (Scott Nicol) writes:
> - No support for wide (>8 bit) character sets. Even 8-bit support is
> fairly recent. The obvious implementation for wide characters
> (expand tables to 16 bits) isn't practical, because you would
> increase the tables sizes (which are already huge) 256x. says...
>Dennis Ritchie (I think) wrote a paper on wide-character regular
>expression matching a few years ago; it used to be included in the
>papers that came with Plan 9, but I haven't seen an online copy in
>several years.
>As far as I can remember, the solution he used for Plan 9 involved
>building sparse tables. This is a fairly obvious thing to do, and the
>extra indirection used to provide sparseness imposes some performance
>penalty, but it saves gobs of space.

I sem to remember reading this paper, but I can't find it now. I must have
moved too many times :-(.

Yes, it is possible to do a sparse-table implementation, and I can think
of a very simple approach (split into high-bit/low-bit tables). I don't
think this has been done for Lex, yet.

This still won't fix another i18n problem that I forgot to include in my
original post. Hard-coded RE's (like Lex) won't work in a multi-locale
environment. In other words, you may code the following in lex:

[[:alpha:]]+ return IDENTIFIER;

but lex will transform the [[:alpha:]] into a hard-coded table. If you
then run the scanner under a different locale (where [[:alpha:]] is
different), then the scanner will still be working in the original locale,
because that is what is written in its tables.

Scott Nicol
Information Advantage, Inc
[I actually exchanged some mail with Vern about this a while ago, and
suggested that he make up a character class table by enumerating all of
the possible character values and using isalpha() etc. to figure out what's
what. Admittedly, this would be kind of slow with 16 bit characters, but
with 8 bit characters it works fine. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.