Re: Internal Representation of Strings

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Sat, 14 Feb 2009 12:43:57 +0100

          From comp.compilers

Related articles
Internal Representation of Strings tony@my.net (Tony) (2009-02-14)
Re: Internal Representation of Strings mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-02-14)
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-14)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-14)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-14)
Re: Internal Representation of Strings anton@mips.complang.tuwien.ac.at (2009-02-14)
Re: Internal Representation of Strings cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-14)
Re: Internal Representation of Strings lkrupp@pssw.nospam.com.invalid (Louis Krupp) (2009-02-14)
[32 later articles]
| List of all articles for this month |

From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.compilers
Date: Sat, 14 Feb 2009 12:43:57 +0100
Organization: cbb software GmbH
References: 09-02-051
Keywords: storage
Posted-Date: 14 Feb 2009 16:46:17 EST

On Sat, 14 Feb 2009 01:21:26 -0600, Tony wrote:


> What are some good ways/concepts of internal string representation?


To have several related string types bound by a subtype relation, so that
the user would not see representation differences.


> Are/should string literals, fixed-length strings and dynamic-lenght strings
> handled differently?


Yes, but transparently to the user.


> My first tendency is to avoid like the plague
> NUL-terminated strings (aka, C strings) and to opt for some kind of array
> with a length at the beginning followed by the characters that could be
> encapsulated at the library level with appropriate functions.


The issue is IMO deeper than merely NUL controversy. The problem is about
encoded vs. raw strings. C-ish NUL nonsense arise from mixing "string", an
array of code points from 1 to 255 with "encoded string", arrays of octets
used to represent the former. NUL merely is an artefact of abstraction
inversion when code points and octets get confused.


If you want to support UTF-8 and other encoding stuff you should carefully
separate raw arrays of storage elements from arrays of code points. What
you would like to call "string" in your language is up to you.


> But just a length seems like not enough information: the capacity
> (array length) also would be nice to have around. All thoughts, old
> and novel, welcome.


Typically you have some index type to be used with the array. An
instance of the array with the index constrained to some contiguous
range is a string object. The nature of the constraint would define
the representation. If the constraint is statically known (as with
literals), it makes no sense to keep it (index range bounds) in the
object. So you inevitably need many representations. On the other
side, when you pass a statically constrained string to a subprogram
expecting a run-time constrained one, you would need to generate dope
to the string body. If you don't want to copy string body you have to
pass the dope separately to the body etc.


Another issue is empty strings. You have to be able to express the
constraint of null range. The problem is that null range cannot be
always specified as lower-upper bounds pair. The technique used for
instance in Ada, where N..N-1 gives an empty array is quite
problematic. When the index type has only one value there is no way to
construct an empty array of. Another approach would be to use the
lower bound and the range length. This has a drawback that the length
must be a numeric type, while the array index could be an enumeration,
which does not have "+" and "-". And it is still problematic when the
index type has no values. Empty array would still exist, but there
would be no lower bound to specify.


--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.