D's confusing strings (was Re: D on hackernews)

Wed Sep 21 12:10:37 PDT 2011

On Wednesday, September 21, 2011 11:20 Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> > On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
> >> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
> >> > unicode natively. Yet the 'D strings are strange and confusing'
> >> > argument comes up quite often on the web.
> >> 
> >> Well, I think they are. The ptr+length stuff is amasing, but the
> >> behavior of strings in phobos is weird.
> >> 
> >> mini-quiz: what should std.range.drop(some_string, 1) do ?
> >> hint: what it actually does is not what the documentation of phobos
> >> suggests*...
> > 
> > What do you mean? It does exactly what it says that it does.
> 
> It does say it uses the slice operator if the range is sliceable, and
> the documentation to isSliceable fails to precise that a narrow string
> is not sliceable.

1. drop says nothing about slicing.
2. popFrontN (which drop calls) says that it slices for ranges that support 
slicing. Strings do not unless they're arrays of dchar.

Yes, hasSlicing should probably be clearer about narrow strings, but that has 
nothing to do with drop.

> > Yeah, well, as long as char is a unicode code unit, that's the way that
> > it goes.
> 
> They are not unicode units.
> 
> void main() {
> char a = 'ä';
> writeln(a); // outputs: \344
> writeln('ä'); // outputs: ä
> }
> 
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.
> 
> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.

The problem with

char a = 'ä';

is completely separate from char[]. That has to do with the fact that the 
compiler isn't properly dealing with narrowing conversions for an individual 
character. A char is _by definition_ a UTF-8 code unit. 

char a = 'ä';

shouldn't even be legal. It's a compiler bug. Most code which operates on 
individual chars or wchars is buggy. dchar is what should be used for 
individual characters. The code you give is buggy because it's trying to use 
char as individual character, and the compiler has a bug, so it doesn't catch 
it.

> > In general, Phobos does a good job of using slicing and other
> > optimizations on strings when it can in spite of the fact that they're
> > not sliceable ranges, but there are cases where the fact that you _have_
> > to process them to be able to find the nth code point means that you
> > just can't process them as efficiently as your typical array. That's
> > life with a variable- length encoding. - and that includes
> > std.algorithm.copy. But that's an easy one to get around, since if you
> > wanted to ignore unicode safety and just copy some chunk of the string,
> > you can always just slice it directly with no need for copy.
> 
> Dealing with utfencoded strings is less efficient, but there is a number
> of algorithms that can be optimized for utfencoded strings, like copying
> or finding an ascii char in a string. Unfortunately, there is no
> practical way to do this with the current range API.
>
> About copy, it's not that easy to overcome the problem if you are using
> a template, and that template happens to be instanciated for strings.

In general, you _must_ treat strings as ranges of dchar if you want your code 
to be correct with regards to code points. So, that's the default. If you know 
that your particular algorithm can be optimized for strings without treating 
them strictly as ranges of dchar and still deal with unicode correctly (e.g. 
by slicing them, because you know the correct point to slice to), then you 
special-case your template for narrow strings. Phobos does that in a number of 
places. But you _have_ to special case it because being able to do so depends 
entirely on your algorithm. You can't treat strings that way in the general 
case, or yor code is not going to handle unicode correctly.

No, the way that strings are handled in D is not perfect. But we're trying to 
balance correctness and efficiency, and it does a good job of that. Using 
strings as ranges of dchar is correct, and in a large number of cases, it is 
as efficient as your going to get it (perhaps particular functions could be 
better optimized, but the API can't be made more efficient). And in the cases 
where you know you can safely get better efficiency out of strings by special 
casing them, that's what you do. It works. It works fairly well. And no one 
has been able to come up with a solution that's definitely better.

If there's an issue, it's that the message on how to correctly handle strings 
is not necessarily being communicated as well as it needs to be. In a few 
places, the documentation could probably be improved, but what we probobly 
really need in order to help get the message across is more articles on ranges 
and strings as ranges. I've actually partially written an article on that, but 
due to some compiler bugs regarding std.container, I ended up temporarily
shelving it.

- Jonathan M Davis