Re: Multibyte lexical analysis

Henry Spencer <henry@zoo.toronto.edu>
9 Aug 1997 21:32:44 -0400

          From comp.compilers

Related articles
Multibyte lexical analysis rod@querix.co.uk (Rod Chamberlin) (1997-08-07)
Re: Multibyte lexical analysis sreeni@csc.albany.edu (1997-08-09)
Re: Multibyte lexical analysis ok@cs.rmit.edu.au (1997-08-09)
Re: Multibyte lexical analysis henry@zoo.toronto.edu (Henry Spencer) (1997-08-09)
| List of all articles for this month |

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Date: 9 Aug 1997 21:32:44 -0400
Organization: SP Systems, Toronto
References: 97-08-011
Keywords: i18n

Rod Chamberlin <rod@querix.co.uk> wrote:
>Does anybody know of any work that has been done on lexical analysis of
>multibyte character streams. These considerably more difficult to
>tokenize since a single byte of input does not necesarily represent a
>single multibyte character.


In practice, for almost any serious processing of multibyte character
streams, the first thing that has to be done is to convert them to a
fixed-width representation, where each character (as opposed to byte)
is represented as a single value of some sufficiently-wide integer
type. This can be done as a batch operation (convert whole
line/buffer/file before processing) or on the fly (convert as needed
by processing), but it's pretty much necessary. Processing
variable-width characters (which is what assorted buzzwords like
"multibyte character" really mean) directly is prohibitively awkward
for all but the simplest purposes.


>Furthermore, characters that might in single byte context be token
>terminators can appear in multibyte strings (ie the second/subsequent
>byte of a multibyte sequence can be any printable character including
>symbols like !"#$%^&*().


Well-designed variable-width encodings don't have this problem,
actually. Unfortunately, there are some poorly-designed encodings out
there.


> ...One of the major problems is that the character classifications
> must be taken from the locale, rather than a fixed set. This seems
> to completely exclude lex from doing the job since its scanners are
> table driven.


Not necessarily. One ends up doing a mapping from input characters
(not bytes) into internal codes anyway, to keep the table sizes down.


Using a character as a table index is okay when there are 256
characters, but not very good when there are tens of thousands.
However, in practice, the finite-state machines only care which of a
small number of equivalence classes the character falls in. For
example, if keyword recognition is done separately, almost all
alphabetic characters are in one equivalence class (a few may need to
be dealt with separately because they show up in syntaxes like hex
numbers), and a single code meaning "alphabetic" can represent any of
them for finite-state-machine purposes. This cuts table sizes
enormously, which matters not only for memory consumption but also for
things like cache hit rate.


If you're doing such a mapping anyway, it's not fundamentally hard to
build the mapping table at startup time, incorporating locale
information into the mapping.


I'm not aware of an existing version of lex that does anything like
this, although it's not an area I've paid careful attention to. It
might be possible, although not trivial, to do the code mapping as
part of input and use an existing lex (assuming your token syntax is
simple enough that it needs no more than 256 equivalence classes of
characters).
--
| Henry Spencer
| henry@zoo.toronto.edu
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.