Re: Multibyte/Wide Character Sets and Lex.

colas@aye.inria.fr (Colas Nahaboo)
9 Feb 1996 17:17:18 -0500

From comp.compilers

Related articles
Multibyte/Wide Character Sets and Lex. juliano@SYDPO4.AUS.unisys.com (Orbach, Julian ACUS) (1996-02-09)
*Re: Multibyte/Wide Character Sets and Lex. colas@aye.inria.fr* (1996-02-09)**
Re: Multibyte/Wide Character Sets and Lex. sharris@fox.nstn.ca (Sandy Harris) (1996-02-10)
Re: Multibyte/Wide Character Sets and Lex. schwartz@galapagos.cse.psu.edu (1996-02-12)
Re: Multibyte/Wide Character Sets and Lex. pjbumbul@math.uwaterloo.ca (1996-02-13)
Re: Multibyte/Wide Character Sets and Lex. fjh@cs.mu.OZ.AU (1996-02-13)
Re: Multibyte/Wide Character Sets and Lex. peter@csgrs6k1.uwaterloo.ca (1996-02-14)
Re: Multibyte/Wide Character Sets and Lex. mparks@oz.net (Michael Parkes) (1996-02-14)
[1 later articles]

| List of all articles for this month |

From:	colas@aye.inria.fr (Colas Nahaboo)
Newsgroups:	comp.compilers
Date:	9 Feb 1996 17:17:18 -0500
Organization:	Koala Project, Bull Research France
References:	96-02-065
Keywords:	lex, i18n

"Orbach, Julian ACUS" <juliano@SYDPO4.AUS.unisys.com> writes:
|> [I don't know of any lex that handles wider than 8 bit characters.
|> The extension from 8 to 16 bit lexers isn't straightforward, since
|> most 8 bit lexers use the character codes as array indices.

re2c does not handle 16-bit wide chars, but it is the only lexer I
know that does not use arrays (but it uses big C switches statements,
which may be a worse idea in your case, I dont know...). you may
contact the authors, see:

ftp://csg.uwaterloo.ca/pub/peter/re2c.0.5.tar.gz

Citing the README:

re2c is a tool for generating C-based recognizers from regular
expressions. re2c-based scanners are efficient: for programming
languages, given similar specifications, an re2c-based scanner is
typically almost twice as fast as a flex-based scanner with little or
no increase in size (possibly a decrease on cisc architectures).
Indeed, re2c-based scanners are quite competitive with hand-crafted
ones.

Unlike flex, re2c does not generate complete scanners: the user must
supply some interface code. While this code is not bulky (about
50-100 lines for a flex-like scanner; see the man page and examples in
the distribution) careful coding is required for efficiency (and
correctness). One advantage of this arrangement is that the generated
code is not tied to any particular input model. For example, re2c
generated code can be used to scan data from a null-byte terminated
buffer as illustrated below.

re2c was developed for a particular project (constructing a fast REXX
scanner of all things!) and so while it has some rough edges, it
should be quite usable. More information about re2c can be found in
the (admittedly skimpy) man page; the algorithms and heuristics used
are described in an upcoming LOPLAS article (included in the
distribution). Probably the best way to find out more about re2c is
to try the supplied examples. re2c is written in C++, and is
currently being developed under Linux using gcc 2.5.8.

--
Colas Nahaboo, Koala, Dyade (Bull) INRIA Sophia,
http://www.inria.fr/koala/colas
[Proposed flex kludge: flex doesn't handle wide characters, but it
sure knows how to handle strings of characters. So map the wide
characters into sequences of 8-bit bytes, like multibyte characters,
and lex that. -John]

--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Multibyte/Wide Character Sets and Lex.

colas@aye.inria.fr (Colas Nahaboo)9 Feb 1996 17:17:18 -0500

colas@aye.inria.fr (Colas Nahaboo)
9 Feb 1996 17:17:18 -0500