Working with utf

Frits van Bommel fvbommel at REMwOVExCAPSs.nl
Thu Jun 14 06:43:20 PDT 2007


Regan Heath wrote:
> I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints.  

Your time away from D (6 months was it?) is showing...
There's such a string implementation at 
http://www.dprogramming.com/dstring.php (Though IIRC it's a struct, not 
a class ;) )

Features:
* Indexing and slicing always works with on code point indices, not code 
units.
* Contents is stored as chars, wchars, dchars (whichever is sufficiently 
large to store every code point in the string as a single code unit).
* The size taken to store a string instance is equal to (char[]).sizeof: 
2 * size_t.sizeof.
* The upper two bits of the size_t containing the length are used as a 
flag for what type the pointer refers to (char/wchar/dchar). This causes 
the maximum length to be cut in 4, but that still allows strings up to 1 
GiB on 32-bit machines, and much bigger on 64-bit machines.

It can still be a problem that sometimes multiple code points are needed 
to encode one "logical" character; the extra ones for diacritics 
(accents). Above-mentioned size limitation can theoretically be a 
problem, but is probably a rare one. (When was the last time you needed 
more than a billion characters in a single string?)

Basically though, this allows you to pretend it's a dchar[] without the 
memory penalty (It only uses dchars internally if you use characters 
outside the BMP (> 0xFFFF), which are rarely needed in non-asian languages).



More information about the Digitalmars-d mailing list