Making all strings UTF ranges has some risk of WTF

Chad J chadjoan at __spam.is.bad__gmail.com
Wed Feb 3 18:50:58 PST 2010


Andrei Alexandrescu wrote:
> ...
> 
> What can be done about that? I see a number of solutions:
> 
> (a) Do not operate the change at all.
> 
> (b) Operate the change and mention that in range algorithms you should
> check hasLength and only then use "length" under the assumption that it
> really means "elements count".
> 
> (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
> a different name for that. Any other name (codeUnits, codes etc.) would
> do. The entire point is to not make algorithms believe strings have a
> .length property.
> 
> (d) Have std.range define a distinct property called e.g. "count" and
> then specialize it appropriately. Then change all references to .length
> in std.algorithm and elsewhere to .count.
> 
> What would you do? Any ideas are welcome.
> 
> 
> Andrei

I'm leaning towards (c) here.

To me the .length on char[] and wchar[] are kinda like doing this:

struct SomePOD
{
    int a, b;
    double y;
}


SomePOD pod;
auto len = pod.length;
assert(len == 16); // true.


I'll admit it's not a perfect analogy.  What I'm playing on here is that
the .length on char[] and wchar[] returns the /size of/ the string in
bytes rather than the /length/ of the string in number of (well-formed)
characters.

Unfortunately .sizeof is supposed to return the size of the string's
reference (8 bytes on x86 systems) and not the size of the string, IIRC.
 So that's taken.

So perhaps a .bytes or .nbytes property.  Maybe make it work for arrays
of structs and things like that too.  A tuple (or any container) of
non-homogeneous elements could probably benefit from this property as well.

Given such a property being available, I wouldn't miss .length at all.
It's quite misleading.



More information about the Digitalmars-d mailing list