Working with utf
Frits van Bommel
fvbommel at REMwOVExCAPSs.nl
Thu Jun 14 06:43:20 PDT 2007
Regan Heath wrote:
> I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints.
Your time away from D (6 months was it?) is showing...
There's such a string implementation at
http://www.dprogramming.com/dstring.php (Though IIRC it's a struct, not
a class ;) )
Features:
* Indexing and slicing always works with on code point indices, not code
units.
* Contents is stored as chars, wchars, dchars (whichever is sufficiently
large to store every code point in the string as a single code unit).
* The size taken to store a string instance is equal to (char[]).sizeof:
2 * size_t.sizeof.
* The upper two bits of the size_t containing the length are used as a
flag for what type the pointer refers to (char/wchar/dchar). This causes
the maximum length to be cut in 4, but that still allows strings up to 1
GiB on 32-bit machines, and much bigger on 64-bit machines.
It can still be a problem that sometimes multiple code points are needed
to encode one "logical" character; the extra ones for diacritics
(accents). Above-mentioned size limitation can theoretically be a
problem, but is probably a rare one. (When was the last time you needed
more than a billion characters in a single string?)
Basically though, this allows you to pretend it's a dchar[] without the
memory penalty (It only uses dchars internally if you use characters
outside the BMP (> 0xFFFF), which are rarely needed in non-asian languages).
More information about the Digitalmars-d
mailing list