D's confusing strings (was Re: D on hackernews)

Wed Sep 21 11:20:55 PDT 2011

Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>> > unicode natively. Yet the 'D strings are strange and confusing' argument
>> > comes up quite often on the web.
>> 
>> Well, I think they are. The ptr+length stuff is amasing, but the
>> behavior of strings in phobos is weird.
>> 
>> mini-quiz: what should std.range.drop(some_string, 1) do ?
>> hint: what it actually does is not what the documentation of phobos
>> suggests*...
> 
> What do you mean? It does exactly what it says that it does.

It does say it uses the slice operator if the range is sliceable, and
the documentation to isSliceable fails to precise that a narrow string 
is not sliceable.

> Yeah, well, as long as char is a unicode code unit, that's the way that it 
> goes.

They are not unicode units.

void main() {
  char a = 'ä';
  writeln(a); // outputs: \344
  writeln('ä'); // outputs: ä
}

Obviouly, a code unit don't fit in a char.
Thus 'char[]' is not what the name claims it is.

Unicode operations should be supported by a different class that 
is really a lazy range of dchar implemented as an undelying char[], with 
no length, index, or stride operator, and appropriate optimizations.

> In general, Phobos does a good job of using slicing and other 
> optimizations on strings when it can in spite of the fact that they're not 
> sliceable ranges, but there are cases where the fact that you _have_ to 
> process them to be able to find the nth code point means that you just can't 
> process them as efficiently as your typical array. That's life with a variable-
> length encoding. - and that includes std.algorithm.copy. But that's an easy one 
> to get around, since if you wanted to ignore unicode safety and just copy some 
> chunk of the string, you can always just slice it directly with no need for 
> copy.

Dealing with utfencoded strings is less efficient, but there is a number 
of algorithms that can be optimized for utfencoded strings, like copying 
or finding an ascii char in a string. Unfortunately, there is no 
practical way to do this with the current range API.

About copy, it's not that easy to overcome the problem if you are using 
a template, and that template happens to be instanciated for strings.

>> * Please, someone just adds in the documentation of IsSliceable that
>> narrow strings are an exception, like it was recently added to
>> hasLength.
> 
> A good point.

The main point of my post actually.

-- 
Christophe