|Programming language specification languages email@example.com (2001-09-20)|
|Re: Programming language specification languages firstname.lastname@example.org (2001-09-25)|
|Re: Programming language specification languages email@example.com (Joachim Durchholz) (2001-10-06)|
|I'm russian! was character sets firstname.lastname@example.org (2001-10-13)|
|Re: I'm russian! was character sets email@example.com (Thomas Maslen) (2001-10-20)|
|Unicode, was: I'm Russian! firstname.lastname@example.org (Ray Dillinger) (2001-11-25)|
|Re: Unicode, was: I'm Russian! email@example.com (Martin von Loewis) (2001-11-26)|
|From:||Martin von Loewis <firstname.lastname@example.org>|
|Date:||26 Nov 2001 21:56:48 -0500|
|Organization:||Humboldt University Berlin, Department of Computer Science|
|References:||01-09-087 01-09-106 01-10-021 01-10-061 01-10-105 01-11-103|
|Posted-Date:||26 Nov 2001 21:56:48 EST|
Ray Dillinger <email@example.com> writes:
> First, there are multiple characters that look the same. We are all
> familiar with the problems posed by lower-case "L" and the digit "1",
> and by the upper-case "O" and the digit "0". Unicode has multiplied
> these problems by hundreds, making it possible to create pages of code
> that look correct but will not compile in any reasonable system.
I think this is a made-up problem. Mixing, say, a latin A and a
cyrillic A simply won't happen in a program, in real life (unlike the
1/l problem, which does happen). People writing cyrillic identifiers
will use the cyrillic A consistently, and anybody else will have
problems even typing these identifiers in the first place.
For a long time, many programmers will restrict themselves to ASCII
for identifier, because of the problem that you can't properly use a
library with, say, Kanji identifiers without a Kanji keyboard.
> Second, the characters have "directionality". This interferes with
> the programmer's understanding of the sequence of characters in a
> page of source code, causing yet more debugging problems.
Strings always have the "right" directionality in Unicode. It would be
pointless to display Arabic characters in source code in a
left-to-right fashion; nobody could read it anymore. This is not a
problem with Unicode; it is inherent in the languages.
> Third, "Endian" issues can happen as unicode documents migrate across
> platforms; the efforts of the committee to provide a "graceful"
> solution instead require particular and special handling in all
> migrations, and code that can recognize either endianness.
I think this problem will disappear in the long run as everybody will
use UTF-8 for Unicode files.
> Fourth, the system that was supposed to finally save us from having
> multiple different character lengths mixed together as we mixed
> alphabets has gone utterly mad; now there are 8, 16, 20, and 32-bit
> representations for characters within Unicode.
Again, in the long run, I expect that programmers can continue to
assume simple indexing of characters in a string (leaving alone
normalizing issues). Libraries will either use a fixed-width
representation for internal storage, or transparently offer random
access on a variable-length representation. Unlike earlier multi-byte
encodings, this is possible for Unicode with little effort; you can
use the same indexing algorithm for all documents.
> Fifth, there are "holes" in the sequence of unicode character codes
> and applications have to be aware of them. This makes iterating over
> the code points into a major pain in the butt.
Why would you want to do that?
> Sixth, I don't want to add all the code and cruft to every system I
> produce, that I would have to add to support the complexities and
> subtleties of Unicode. It's just not worth it.
Right. Instead, all the cruft is in the system libraries (just like it
is for ASCII).
> If a simple, 32-bit "extended ascii" code comes along, I'll be the
> first to support it. But Unicode as we now see it is a crock.
There won't be anything else. Just assume that Unicode is a simple,
32-bit "extended ascii" today, and make every input you get fit that
view of the world.
Return to the
Search the comp.compilers archives again.