Making all strings UTF ranges has some risk of WTF

Thu Feb 4 19:42:04 PST 2010

On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> Rainer Deyke wrote:
>> Don wrote:
>>> I suspect that string, wstring should have been the primary types and
>>> had a .codepoints property, which returned a ubyte[] resp. ushort[]
>>> reference to the data. It's too late, of course. The extra value you get
>>> by having a specific type for 'this is a code point for a UTF8 string'
>>> seems to be very minor, compared to just using a ubyte.
>> 
>> If it's not too late to completely change the semantics of char[], then
>> it's also not too late to dump 'char' completely.  If it /is/ too late
>> to remove 'char', then 'char[]' should retain the current semantics and
>> a new string type should be added for the new semantics.
> 
> One idea I've had for a while was to have a universal string type:
> 
> struct UString {
>      union {
>          char[] utf8;
>          wchar[] utf16;
>          dchar[] utf32;
>      }
>      enum Discriminator { utf8, utf16, utf32 };
>      Discriminator kind;
>      IntervalTree!(size_t) skip;
>      ...
> }

That's a nice concept, but it seems to me that it adds much overhead to 
improve a rather niche area. It's not often that you need to access 
characters by index. Generally when you need to it's because you've 
already parsed the string and want to return to a previous location, in 
which case you'd better when you first parse to just save the range or 
the index in code units rather than the index in code point.

But I have to say quite satisfied in the way D handle strings in 
general. Easy access to code points and direct access to the data is 
quite handy. I think it fits very well with a low-level language.

I'd say in general when manipulating strings I rarely need to bother 
about code points. Most of the time I'm just searching for ASCII-range 
markers when parsing so I can search for them directly as code units, 
not bothering at all about multi-byte characters. That's why I'm a 
little wary about your changes. If I'm looking for a substring then I 
can search by code units too. It's just for the more fancy stuff 
(case-insensitive searching, character transformation) that it becomes 
necessary to work with code points.

That's why I'm a little wary about your changes in that area: I fear 
it'll make the common case of working with code units more difficult to 
deal with. But I won't judge before I see.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/