Making all strings UTF ranges has some risk of WTF

Michel Fortin michel.fortin at michelf.com
Wed Feb 3 20:16:47 PST 2010


On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> It's no secret that string et al. are not a magic recipe for writing 
> correct Unicode code.
> 
> [...]
> 
> What would you do? Any ideas are welcome.

UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8 
string and want to search for an occurrence of that string in another 
UTF-8 string, you don't have to decode each multi-byte code-points: a 
binary comparison is enough. If you're counting counting the number of 
code points, then all you need is to count the number of code unit with 
the most significant bit set to zero. If on the other hand you're 
applying a character-by-character transformation, then you need to 
fully decode each character, unless you're only interested in 
transforming characters from the lower non-multibyte subrange of the 
encoding (which happens quite often).

Clearly, I don't think there's a one-size-fit-all way to iterate over 
string arrays. Fully decoding each code unit is clearly the most costly 
method; it shouldn't be required when its not necessary.

I think we need to be able to represent char[] and wchar[] as a range 
of dchar to deal with cases where you want to iterate over Unicode code 
points, but I'd let the programmer ultimately decide what to do.

As for .length, I'll say that removing this property would make it hard 
to write low-level code. For instance, if I copy a string into a 
buffer, I need to know the length in bytes (array.length * 
sizeof(array[0])), not the number of characters. So it doesn't make 
much sense to disable .length.

So my answer would be mostly to leave things as they are.

Perhaps the char[] and wchar[] as dchar ranges could be aliased to 
string and wstring, but that'd definitely be a blow to the philosophy 
of strings as simple arrays. You'd also still need to be able to access 
the actual array underneath. And will all the implicit conversions 
still work? I'm really not sure it's worth it, but perhaps.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/




More information about the Digitalmars-d mailing list