Making all strings UTF ranges has some risk of WTF
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Wed Feb 3 18:53:55 PST 2010
Chad J wrote:
> Andrei Alexandrescu wrote:
>> ...
>>
>> What can be done about that? I see a number of solutions:
>>
>> (a) Do not operate the change at all.
>>
>> (b) Operate the change and mention that in range algorithms you should
>> check hasLength and only then use "length" under the assumption that it
>> really means "elements count".
>>
>> (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
>> a different name for that. Any other name (codeUnits, codes etc.) would
>> do. The entire point is to not make algorithms believe strings have a
>> .length property.
>>
>> (d) Have std.range define a distinct property called e.g. "count" and
>> then specialize it appropriately. Then change all references to .length
>> in std.algorithm and elsewhere to .count.
>>
>> What would you do? Any ideas are welcome.
>>
>>
>> Andrei
>
> I'm leaning towards (c) here.
>
> To me the .length on char[] and wchar[] are kinda like doing this:
>
> struct SomePOD
> {
> int a, b;
> double y;
> }
>
>
> SomePOD pod;
> auto len = pod.length;
> assert(len == 16); // true.
>
>
> I'll admit it's not a perfect analogy. What I'm playing on here is that
> the .length on char[] and wchar[] returns the /size of/ the string in
> bytes rather than the /length/ of the string in number of (well-formed)
> characters.
>
> Unfortunately .sizeof is supposed to return the size of the string's
> reference (8 bytes on x86 systems) and not the size of the string, IIRC.
> So that's taken.
>
> So perhaps a .bytes or .nbytes property. Maybe make it work for arrays
> of structs and things like that too. A tuple (or any container) of
> non-homogeneous elements could probably benefit from this property as well.
>
> Given such a property being available, I wouldn't miss .length at all.
> It's quite misleading.
I hear you. Actually, to either quench or add to the confusion, .length
for wstring returns the length in 16-bit units, not bytes.
Andrei
More information about the Digitalmars-d
mailing list