Re: Internal Representation of Strings

"cr88192" <cr88192@hotmail.com>
Mon, 16 Feb 2009 09:27:29 +1000

          From comp.compilers

Related articles
[2 earlier articles]
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-14)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-14)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-14)
Re: Internal Representation of Strings anton@mips.complang.tuwien.ac.at (2009-02-14)
Re: Internal Representation of Strings cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-14)
Re: Internal Representation of Strings lkrupp@pssw.nospam.com.invalid (Louis Krupp) (2009-02-14)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-16)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-15)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-16)
Re: Internal Representation of Strings bartc@freeuk.com (Bartc) (2009-02-16)
Re: Internal Representation of Strings wclodius@lost-alamos.pet (2009-02-16)
Re: Internal Representation of Strings ArarghMail902@Arargh.com (2009-02-17)
Re: Internal Representation of Strings bartc@freeuk.com (Bartc) (2009-02-18)
[25 later articles]
| List of all articles for this month |

From: "cr88192" <cr88192@hotmail.com>
Newsgroups: comp.compilers
Date: Mon, 16 Feb 2009 09:27:29 +1000
Organization: albasani.net
References: 09-02-051
Keywords: storage
Posted-Date: 16 Feb 2009 04:27:30 EST

"Tony" <tony@my.net> wrote
> What are some good ways/concepts of internal string representation?


depends on how they will be being used.
I opt to minimize interop issues with existing code.


> Are/should string literals, fixed-length strings and dynamic-lenght
> strings handled differently? My first tendency is to avoid like the
> plague NUL-terminated strings (aka, C strings) and to opt for some
> kind of array with a length at the beginning followed by the
> characters that could be encapsulated at the library level with
> appropriate functions.


IMO, this should not be necessary (avoiding ASCIIZ style strings).
the reason is that, if strings are actually being used for text, NUL
should almost never happen in practice.


However, if one is using UTF-8, there is a trick used by the JVM and
friends, which maps the NUL character to a different representation. I
ended up adding a similar hack in my case for UFT-16, where the NUL is
hidden in another codepoint (I used 10FFFE I think, I forget, though
in UTF-16 there is no good or obvious option...). the main reason for
this hack though is mostly to allow converting strings with embedded
NULs into UTF-16 without truncating the string.


As for UTF-8 vs UTF-16, I personally by-far opt for UTF-8, mostly
because it saves a good deal of space in most cases (and also,
"proper" handling of UTF-16 isn't really any easier than UTF-8, where
I don't consider treating UTF-16 text as a big array of 16-bit values
as proper, especially if there is a risk of characters outside the
16-bit range, ...).


In general, I keep UTF-8 and UTF-16 strings separate, but it is
possible that some library functions could allow both equally, for
example, by either using automatic recognition (problematic), or by
having UTF-16 strings generally start with a BOM (inconvinient).


Something used in my case is actually, that, every on-heap object has
an associated type (this is transparent, as the heap-pointer points
after the header), and so I can use different type names for different
string types.


Note that it is problematic, as conversions between UTF-8 and UTF-16
may not be in all cases lossless (one may convert from one to the
other and back again and in some cases get a different string, ...).


As for length-prefixed strings, there is this often critical and
awkward limitation: the strings will often have an arbitrary max
length limit.


So, for many cases where frameworks have used length prefixes, this ended up
being an arbitrary limit at 127, 255, 32767, or 65535 characters...
this requires either using a size field larger than the maximum sane size of
a string (for example, 32 or 64 bits...), or the use of a variable-length
coding (more what I would opt for, where a small string only needs a single
byte for the length, but a very long string can still be handled).




> But just a length seems like not enough information: the capacity
> (array length) also would be nice to have around. All thoughts, old
> and novel, welcome.


This goes into the awkward area where many frameworks opt for
representing their strings as objects. the cost to this though is
that it may add a good amount of overhead to working with strings, but
also a few partial advantages (for example, it is much easier to
create substrings, get multiple string literals to share the same
character array, ...).


But, on this front, another case is whether or not the implementation
should make use of "ropes" for the strings, where the idea is that any
non-trivial string becomes a rope, ...


as another thing:
it makes sense to make a contrast between "strings" and "character buffers".


So, in my thinking, strings, as such, are always assumed to be
immutable. so, if one wants to modify a string, they copy it to a
character buffer, and then later retrieve the new string from this
buffer.


So, for example, the string is UTF-8, but could be copied into a
UTF-16 or UTF-32 buffer, worked on, and then copied back into a UTF-8
or UTF-16 string when done.


One advantage of this approach is that it allows effectively merging
duplicate copies of the same strings (one can keep a big hash tables
of seen strings, and so if a string is seen for one already in the
hash, it is reused, and if not, it is added).


As a cost though, one has to wait for the GC to collect strings (I use
a weak hash for all this), which can hurt performance in some cases
(dealing with a large number of unique short-lived strings creating a
good deal of strain on the garbage collector).




For many tasks, I may also use what I call a "rotator", where the
strings (and often many other things) are essentially allocated in a
big circular buffer, and it is assumed that the code will be done with
them before they get overwritten (although it is crude, it is fairly
effective and convinient). this reduces the amount of garbage
generated for short-lived strings and objects in many cases.


So, if the string needs to be preserved, it is copied from the rotator
onto the main heap, and becomes the GC's responsibility.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.