Making all strings UTF ranges has some risk of WTF
Michel Fortin
michel.fortin at michelf.com
Wed Feb 3 20:16:47 PST 2010
On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> It's no secret that string et al. are not a magic recipe for writing
> correct Unicode code.
>
> [...]
>
> What would you do? Any ideas are welcome.
UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8
string and want to search for an occurrence of that string in another
UTF-8 string, you don't have to decode each multi-byte code-points: a
binary comparison is enough. If you're counting counting the number of
code points, then all you need is to count the number of code unit with
the most significant bit set to zero. If on the other hand you're
applying a character-by-character transformation, then you need to
fully decode each character, unless you're only interested in
transforming characters from the lower non-multibyte subrange of the
encoding (which happens quite often).
Clearly, I don't think there's a one-size-fit-all way to iterate over
string arrays. Fully decoding each code unit is clearly the most costly
method; it shouldn't be required when its not necessary.
I think we need to be able to represent char[] and wchar[] as a range
of dchar to deal with cases where you want to iterate over Unicode code
points, but I'd let the programmer ultimately decide what to do.
As for .length, I'll say that removing this property would make it hard
to write low-level code. For instance, if I copy a string into a
buffer, I need to know the length in bytes (array.length *
sizeof(array[0])), not the number of characters. So it doesn't make
much sense to disable .length.
So my answer would be mostly to leave things as they are.
Perhaps the char[] and wchar[] as dchar ranges could be aliased to
string and wstring, but that'd definitely be a blow to the philosophy
of strings as simple arrays. You'd also still need to be able to access
the actual array underneath. And will all the implicit conversions
still work? I'm really not sure it's worth it, but perhaps.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list