Making all strings UTF ranges has some risk of WTF

grauzone none at example.net
Wed Feb 3 19:05:00 PST 2010


Andrei Alexandrescu wrote:
> What can be done about that? I see a number of solutions:
> 
> (a) Do not operate the change at all.
> 
> (b) Operate the change and mention that in range algorithms you should 
> check hasLength and only then use "length" under the assumption that it 
> really means "elements count".
> 
> (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
> a different name for that. Any other name (codeUnits, codes etc.) would 
> do. The entire point is to not make algorithms believe strings have a 
> .length property.
> 
> (d) Have std.range define a distinct property called e.g. "count" and 
> then specialize it appropriately. Then change all references to .length 
> in std.algorithm and elsewhere to .count.
> 
> What would you do? Any ideas are welcome.

Change the type of string literals from char[] (or whatever the string 
type is in D2) to a wrapper struct defined in object.d:

struct string {
     char[] raw;
}

Now string.length is invalid, and you don't have to do weird stuff as in 
(b) or (c).

 From here on, you could do 2 things:
1. add accessor methods to string like string classes in other languages do
2. leave the wrapper struct as it is (just add the required range foo), 
and require the user to use either a) the range API (with utf-8 decoding 
etc.) or b) access the raw "byte" string with string.raw.

I really liked how strings were simply char[]'s, but now with immutable, 
there's a lot of noise around it anyway, and there's no real value to 
strings being array slices anymore. Making the user deal directly with 
utf-8 was probably a bad idea to begin with.



More information about the Digitalmars-d mailing list