|Programming language specification languages email@example.com (2001-09-20)|
|Re: Programming language specification languages firstname.lastname@example.org (2001-09-25)|
|Re: Programming language specification languages email@example.com (Joachim Durchholz) (2001-10-06)|
|I'm russian! was character sets firstname.lastname@example.org (2001-10-13)|
|Re: I'm russian! was character sets email@example.com (Thomas Maslen) (2001-10-20)|
|Unicode, was: I'm Russian! firstname.lastname@example.org (Ray Dillinger) (2001-11-25)|
|Re: Unicode, was: I'm Russian! email@example.com (Martin von Loewis) (2001-11-26)|
|From:||Ray Dillinger <firstname.lastname@example.org>|
|Date:||25 Nov 2001 22:37:06 -0500|
|References:||01-09-087 01-09-106 01-10-021 01-10-061 01-10-105|
|Posted-Date:||25 Nov 2001 22:37:06 EST|
Thomas Maslen wrote:
> >> Of course, scripting languages intended for the hand of the end user
> >> *should* be able to support 16-bit characters (which, today, means >
> >> Unicode).
> >Yes. String, chracter and character arrays types and constants must
> >support 16 bit representation. At least.
> Yup, at least.
> Until recently, Unicode and ISO 10646 only defined characters in the range
> U+0000..U+FFFF (the "Basic Multilingual Plane", i.e. the first 16 bits).
> However, Unicode 3.1 now defines characters in three more 16-bit planes:
> U+10000..U+1FFFF, U+20000..U+2FFFF and U+E0000..U+EFFFF. For details, see
> the "New Character Allocations" section of
> All is not lost, because the 16-bit representation of Unicode was designed
> with this in mind, and it can represent U+0000..U+10FFFF (i.e. a little over
> 20 bits) using "surrogate pairs":
I'm actually a trifle irritated at the Unicode Standard. It has
become far more complex than any application I have for characters
required it to be. I don't mind it being bigger than ascii, in fact I
applauded when I heard that there was going to be a more inclusive
character set standard. However, all the things that make it more
*complicated* than ascii are things that militate against its ever
being used as source code representation.
First, there are multiple characters that look the same. We are all
familiar with the problems posed by lower-case "L" and the digit "1",
and by the upper-case "O" and the digit "0". Unicode has multiplied
these problems by hundreds, making it possible to create pages of code
that look correct but will not compile in any reasonable system.
Second, the characters have "directionality". This interferes with
the programmer's understanding of the sequence of characters in a page
of source code, causing yet more debugging problems.
Third, "Endian" issues can happen as unicode documents migrate across
platforms; the efforts of the committee to provide a "graceful"
solution instead require particular and special handling in all
migrations, and code that can recognize either endianness. When
Endian conversions take place, the bit order of the files change while
no semantic change has taken place, confusing or requiring special
code in every "diff" or version-control system.
Fourth, the system that was supposed to finally save us from having
multiple different character lengths mixed together as we mixed
alphabets has gone utterly mad; now there are 8, 16, 20, and 32-bit
representations for characters within Unicode.
Fifth, there are "holes" in the sequence of unicode character codes
and applications have to be aware of them. This makes iterating over
the code points into a major pain in the butt.
Sixth, I don't want to add all the code and cruft to every system I
produce, that I would have to add to support the complexities and
subtleties of Unicode. It's just not worth it.
If a simple, 32-bit "extended ascii" code comes along, I'll be the first
to support it. But Unicode as we now see it is a crock.
Return to the
Search the comp.compilers archives again.