Re: Internal Representation of Strings

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Sat, 14 Feb 2009 16:41:55 +0100

From comp.compilers

Related articles
Internal Representation of Strings tony@my.net (Tony) (2009-02-14)
Re: Internal Representation of Strings mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-02-14)
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-14)
*Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich)* (2009-02-14)**
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-14)
Re: Internal Representation of Strings anton@mips.complang.tuwien.ac.at (2009-02-14)
Re: Internal Representation of Strings cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-14)
Re: Internal Representation of Strings lkrupp@pssw.nospam.com.invalid (Louis Krupp) (2009-02-14)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-16)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-15)
[30 later articles]

| List of all articles for this month |

From:	Hans-Peter Diettrich <DrDiettrich1@aol.com>
Newsgroups:	comp.compilers
Date:	Sat, 14 Feb 2009 16:41:55 +0100
Organization:	Compilers Central
References:	09-02-051
Keywords:	storage
Posted-Date:	14 Feb 2009 16:48:48 EST

Tony schrieb:

> What are some good ways/concepts of internal string representation?

Depends on your preferences ;-)

> Are/should string literals, fixed-length strings and dynamic-lenght strings
> handled differently?

A mix of string types influences the evaluation of string expressions,
and IMO should be avoided.

> My first tendency is to avoid like the plague
> NUL-terminated strings (aka, C strings)

Delphi appends a hidden trailing zero character to all strings, to
achieve compatibility with API and external library calls, while such
characters have no special meaning inside the "payload" (text). Another
compatibility issue arises from (external) data structures, containing
fixed length strings.

> and to opt for some kind of array
> with a length at the beginning followed by the characters that could be
> encapsulated at the library level with appropriate functions.

That's an almost common approach, with some problems with references to
such data structures - should a reference go to the first character, or
to the begin of the data structure?

> But just a
> length seems like not enough information: the capacity (array length) also
> would be nice to have around. All thoughts, old and novel, welcome.

The capacity of dynamic strings can be bundled with the allocated block
size (memory management). Static strings can use the same structure, for
binary compatibility.

More thoughts:

Windows, Java and perhaps other implementations allow for sharing of
entire strings, or of substrings of existing strings, so that copies
often can be avoided. Such features require some compiler support, i.e.
the language must allow to distinguish between in-place and
copy-on-write changes of an string, and the compiler must produce code
for the management of such strings. Reference counting can be used to
distinguish shared strings, from strings with only one reference (which
can be changed in-place), and to implement kind of garbage collection,
when a string is not referenced any more. Windows also uses the
reference count for flagging constant string literals.

The .NET practice has revealed performance problems with string
operators and their intermediate results. A compiler should provide
further (hidden) procedures for string concatenation, not limited to
only two arguments. When strings are implemented in classes, the classes
should be sealed (no subclasses), so that the compiler can know about
all necessary details of the implementation.

In Java the implicit conversion of strings in mixed-type expressions is
very dangerous.

Dynamic strings and pointers into such strings fit not well together.

Codepages and MBCS (including UTF-8) also are problematic, in processing
strings of a different encoding. Some kind of encoding-type should be
stored with such strings, explicitly or implied by dedicated class types.

Strings often are abused for storing binary data. IMO it's a good idea
to provide an dedicated string type for this purpose, with the usual
management functions, but no further automatisms (no encoding
conversion, no string operators...). IMO a distinction should be made,
between strings as containers of physical (character, byte...) codes
(convenient kind of dynamic arrays), and strings containing text (of a
natural language).

Unicode makes an character type quite useless, because logical
characters can consist of sequences of physical character codes. As with
MBCS types, references into an string deserve special functions for
moving from one logical character position to the next one; both pointer
and index arithmetic is inappropriate with such strings.

IMO the environment (platform, libraries...) should be considered in the
determination of the most appropriate string encoding and character
type, in order to reduce required encoding transformations. Where 16 bit
characters are appropriate on Windows platforms (UTF-16, BSTR), most
other platforms will be happier with UTF-8 strings.

DoDi

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Internal Representation of Strings

Hans-Peter Diettrich <DrDiettrich1@aol.com>Sat, 14 Feb 2009 16:41:55 +0100

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Sat, 14 Feb 2009 16:41:55 +0100