Multibyte lexical analysis

Rod Chamberlin <>
7 Aug 1997 15:16:47 -0400

          From comp.compilers

Related articles
Multibyte lexical analysis (Rod Chamberlin) (1997-08-07)
Re: Multibyte lexical analysis (1997-08-09)
Re: Multibyte lexical analysis (1997-08-09)
Re: Multibyte lexical analysis (Henry Spencer) (1997-08-09)
| List of all articles for this month |

From: Rod Chamberlin <>
Newsgroups: comp.compilers
Date: 7 Aug 1997 15:16:47 -0400
Organization: QueriX(UK) Ltd
Keywords: i18n, lex, question

Does anybody know of any work that has been done on lexical analysis of
multibyte character streams. These considerably more difficult to
tokenize since a single byte of input does not necesarily represent a
single multibyte character.

Furthermore, characters that might in single byte context be token
terminators can appear in multibyte strings (ie the second/subsequent
byte of a multibyte sequence can be any printable character including
symbols like !"#$%^&*().
However, a multibyte identifier must be able to contain any multibyte
character including those containing these symbols. One of the major
problems is that the character classifications must be taken from the
locale, rather than a fixed set. This seems to completely exclude lex
from doing the job since its scanners are table driven.

Whilst it is perfectly possible to write a scanner by hand to do this,
hand written scanners are more difficult to maintain than their
lex-style counterparts.

If anyone could give me any pointers towards anything that has been done
on this in the past it would be grestly appreaciated.


| Rod Chamberlin | Tel +44 1703 232345 |
| Software Engineer | Mob +44 468 387834 |
| QueriX | Fax +44 1703 399685 |

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.