Re: Internal Representation of Strings

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Tue, 24 Feb 2009 17:25:22 +0100

          From comp.compilers

Related articles
[24 earlier articles]
Re: Internal Representation of Strings bartc@freeuk.com (Bartc) (2009-02-22)
Re: Internal Representation of Strings scooter.phd@gmail.com (Scott Michel) (2009-02-22)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-23)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-23)
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-23)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-24)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-24)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-25)
Re: Internal Representation of Strings armelasselin@hotmail.com (Armel) (2009-02-26)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-27)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-28)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-03-03)
Re: Internal Representation of Strings armelasselin@hotmail.com (Armel) (2009-03-02)
[3 later articles]
| List of all articles for this month |

From: Hans-Peter Diettrich <DrDiettrich1@aol.com>
Newsgroups: comp.compilers
Date: Tue, 24 Feb 2009 17:25:22 +0100
Organization: Compilers Central
References: 09-02-051 09-02-077 09-02-092 09-02-104 09-02-112 09-02-118
Keywords: i18n
Posted-Date: 24 Feb 2009 20:55:32 EST

Marco van de Voort schrieb:


>>> in general, UTF-8 takes less space than UTF-16 (and mixes much better with
>>> code designed for ASCII), but some many languages like UTF-16 more
>>> potentially because it works better when being treated as an array.
>> This IMO is a typical misconception of English-only speakers, which have
>> caused a lot of trouble in the evolution of programming languages :-(
>
> Latin only? But afaik even for Cyrillic and the Semitic language group it
> doesn't matter.


The standard libraries were restricted to the ASCII character set in
the past, with trouble in all languages with more or entirely
different characters. The introduction of codepages then allowed to
use at least 256 characters, and nowadays most people extend the frame
only to the Unicode BMP, and ignore other codepages with regards to
memory and runtime requirements.


Unicode itself was a big leap, but introduced new problems. The UCS-2
standard ignored (willingly?) the amount of characters in well known
Chinese "character" sets. Tell me a programming language with proper
support for native language text in string literals, and Unicode
source code representation. The C extension with wide string literals
(L"...") is a bad hack, because it ignores the existence of various
possible character encodings. It had been easier to stay with 8 bit
characters, and only move to UTF-8 as the unique encoding for strings,
so that every Unicode-aware editor could be used to edit source code
in any natural language, and could store everything in a unique UTF-8
source file encoding.


Not much better outside programming languages, also HTML and related
standards use inconsistent encodings, differently restricted
characters sets in dedicated places, and different escape sequences.


DoDi


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.