Re: Multibyte lexical analysis

ok@cs.rmit.edu.au (Richard A. O'Keefe)
9 Aug 1997 20:12:55 -0400

          From comp.compilers

Related articles
Multibyte lexical analysis rod@querix.co.uk (Rod Chamberlin) (1997-08-07)
Re: Multibyte lexical analysis sreeni@csc.albany.edu (1997-08-09)
Re: Multibyte lexical analysis ok@cs.rmit.edu.au (1997-08-09)
Re: Multibyte lexical analysis henry@zoo.toronto.edu (Henry Spencer) (1997-08-09)
| List of all articles for this month |

From: ok@cs.rmit.edu.au (Richard A. O'Keefe)
Newsgroups: comp.compilers
Date: 9 Aug 1997 20:12:55 -0400
Organization: Comp Sci, RMIT University, Melbourne, Australia.
References: 97-08-011
Keywords: i18n, lex

Rod Chamberlin <rod@querix.co.uk> writes:
>Does anybody know of any work that has been done on lexical analysis of
>multibyte character streams.


Quintus Prolog supported multibyte characters a little over 10 years
ago. The systems where it did so used a coding where 0xxxxxxx was an
ASCII code and any multibyte sequence was made up of 1xxxxxxx bytes,
so we just "cheated" and said that those codes were all letters.


The Plan 9 system has a C compiler that works directly with the 8-bit
code stream.


The free Gambit-C Scheme system accepts a range of codings for the
16-bit characters it accepts.


C++ and C9x follow Java's lead in saying that the input is notionally
wide characters, with \uXXXX being a 16-bit character that may appear
in identifiers and strings and \Uxxxxxxxx being a 32-bit character
that may appear in identifiers and strings. If nothing else, this is
a multibyte coding.


IBM's mainframe compilers for PL/I and Fortran accepted 16-bit characters
in comments and strings years ago, and still do.


The Unicode book gives the rules for what numbers and identifiers should
look like if you support Unicode.
--
Richard A. O'Keefe; http://www.cs.rmit.edu.au/%7Eok; RMIT Comp.Sci.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.