Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Wed Feb 3 18:53:55 PST 2010


Chad J wrote:
> Andrei Alexandrescu wrote:
>> ...
>>
>> What can be done about that? I see a number of solutions:
>>
>> (a) Do not operate the change at all.
>>
>> (b) Operate the change and mention that in range algorithms you should
>> check hasLength and only then use "length" under the assumption that it
>> really means "elements count".
>>
>> (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
>> a different name for that. Any other name (codeUnits, codes etc.) would
>> do. The entire point is to not make algorithms believe strings have a
>> .length property.
>>
>> (d) Have std.range define a distinct property called e.g. "count" and
>> then specialize it appropriately. Then change all references to .length
>> in std.algorithm and elsewhere to .count.
>>
>> What would you do? Any ideas are welcome.
>>
>>
>> Andrei
> 
> I'm leaning towards (c) here.
> 
> To me the .length on char[] and wchar[] are kinda like doing this:
> 
> struct SomePOD
> {
>     int a, b;
>     double y;
> }
> 
> 
> SomePOD pod;
> auto len = pod.length;
> assert(len == 16); // true.
> 
> 
> I'll admit it's not a perfect analogy.  What I'm playing on here is that
> the .length on char[] and wchar[] returns the /size of/ the string in
> bytes rather than the /length/ of the string in number of (well-formed)
> characters.
> 
> Unfortunately .sizeof is supposed to return the size of the string's
> reference (8 bytes on x86 systems) and not the size of the string, IIRC.
>  So that's taken.
> 
> So perhaps a .bytes or .nbytes property.  Maybe make it work for arrays
> of structs and things like that too.  A tuple (or any container) of
> non-homogeneous elements could probably benefit from this property as well.
> 
> Given such a property being available, I wouldn't miss .length at all.
> It's quite misleading.

I hear you. Actually, to either quench or add to the confusion, .length
for wstring returns the length in 16-bit units, not bytes.

Andrei



More information about the Digitalmars-d mailing list