Re: UCS Identifiers and compilers

Ray Dillinger <>
Thu, 11 Dec 2008 20:03:16 -0800

          From comp.compilers

Related articles
UCS Identifiers and compilers (2008-12-10)
Re: UCS Identifiers and compilers (Hans-Peter Diettrich) (2008-12-11)
Re: UCS Identifiers and compilers (Dmitry A. Kazakov) (2008-12-11)
Re: UCS Identifiers and compilers (James Harris) (2008-12-11)
Re: UCS Identifiers and compilers (Marco van de Voort) (2008-12-11)
Re: UCS Identifiers and compilers (Ira Baxter) (2008-12-11)
Re: UCS Identifiers and compilers (Ray Dillinger) (2008-12-11)
Re: UCS Identifiers and compilers (Chris F Clark) (2008-12-11)
Re: UCS Identifiers and compilers (Bartc) (2008-12-12)
Re: UCS Identifiers and compilers (Mike Austin) (2008-12-12)
| List of all articles for this month |

From: Ray Dillinger <>
Newsgroups: comp.compilers
Date: Thu, 11 Dec 2008 20:03:16 -0800
Organization: Organized? Seems unlikely.
References: 08-12-061
Keywords: i18n
Posted-Date: 12 Dec 2008 10:33:26 EST

William Clodius wrote:

> As a hobby I have started work on a language design and one of the
> issues that has come to concern me is the impact on the usefulness and
> complexity of implementation is the incorporation of UCS/Unicode into
> the language, particularly in identifiers.

> 1. Do many of your users make use of letters outside the ASCII/Latin-1
> sets?

Yes, in particular users in Asia. Also, a large subset of users use a
lot of mathematical notation.

> 2. What are the most useful development environments in terms of dealing
> with extended character sets?

Emacs with Mule. Your mileage may vary.

> 3. Visually how well do alternative character sets mesh with a language
> with ASCII keywords and left to right, up and down display, typical of
> most programming languages? eg. how well do scripts with ideographs,
> context dependent glyphs for the same character, and alternative saptail
> ordering work, or character sets with characters with glyphs similar to
> those used for ASCII (the l vs 1 and O vs. 0 problem multiplied)

That's a problem with Unicode, on a couple of different levels. First
you have to be able to identify strings that are "the same" for
purposes of variable names, and that means you need to assure that
everything is compared in the same canonicalization form.

Second you have a lot of glyphs that look alike, and you need to keep
them straight. Your first and best method of doing this is to
disallow so-called "compatibility characters," which essentially
repeat particular characters at different codepoints in unicode. I
advise throwing these out of both source code and string literals.
Also, Unicode has ligatures which have canonical decompositions into
sequences of simpler characters, and you have to have a plan for
dealing with them. The simplest plan is to disallow them entirely.
It is a valid choice to have your users deal exclusively with the
simpler characters and treat ligatures solely as a typesetting
issue. Another plan is below.

I believe that things work best when canonicalization issues are kept
below the level of user code as much as possible. In other words, your
"character" type should be analogous to C's cchar_t, which contains a
spacing unicode codepoint plus well-formed sequence of accents, variant
selectors, etc, and your user should deal on the whole with these
characters rather than individual codepoints. If you're dealing with
ligatures, you may choose to have more than one spacing codepoint in a
character, with all but the first preceded by a U+034F COMBINING GRAPHEME
JOINER. That gives you leave to keep the codepoints within each
character, and the choice of which codepoints to use to represent a
character, in a canonicalized form which drastically simplifies issues
of identity and comparison. As a bonus, case operations (capitalization,
etc) never change the length of a string except when dealing with German

> 4. How does the incorporation of the larger character sets affect your
> lexical analysis? Is hash table efficiency affected? Do you have to deal
> with case/accent independence and if so how useful are the UCS
> recommendations for languages?

On the whole, you can allocate the tables for the whole Unicode set
in the process space on a modern machine. If your grammar is
reasonably simple, you won't even have to have a lot of swapping
during the parse. You can also use standard 256-entry tables to
parse the UTF-8 representation directly using a grammar that's aware
of the byte encodings. So there are two viable strategies; you can
just bite the bullet of increased memory usage, or you can get a bit
tricky and go below the whole-codepoint level when writing your
grammar. Either strategy is viable and reasonably efficient.


Note: if you want more discussion of Unicode issues in a programming
language than most people can deal with, check the archives from our
discussions on integrating Unicode with the Scheme language at

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.