D's confusing strings (was Re: D on hackernews)

Wed Sep 21 13:26:40 PDT 2011

On Wednesday, September 21, 2011 19:56:47 Christophe wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
> > 1. drop says nothing about slicing.
> > 2. popFrontN (which drop calls) says that it slices for ranges that
> > support slicing. Strings do not unless they're arrays of dchar.
> > 
> > Yes, hasSlicing should probably be clearer about narrow strings, but
> > that has nothing to do with drop.
> 
> I never said there was a problem with drop.

Yes you did. You said:

"mini-quiz: what should std.range.drop(some_string, 1) do ?
hint: what it actually does is not what the documentation of phobos 
suggests*..."

> After having read all of you, I have no problems with string being a
> lazy range of dchar. But I have a problem with immutable(char)[] being
> lazy range of dchar (ie not being a array), and I have a problem with
> string being immutable(char)[] (ie providing length opIndex and
> opSlice).

For efficiency, you need to be able to treat strings as arrays of code units for 
some algorithms. For correctness, you need to be able to treat them as ranges 
of code points (dchar) in the general case. You need both. The question is how 
to provide that. strings as arrays came first (D1), whereas ranges came later. 
We _need_ to treat strings as ranges of dchar or they're essentially unusable 
in the general case. Operating on code units is almost always _wrong_. So, 
when we added the range functions, we special-cased them for strings so that 
strings are treated as ranges of dchar as they need to be. And in cases where 
you actually need to treat a string as an array of code units for efficiency, 
you special case the function for them, and you still get that. What other way 
would you do it?

There _are_ some edges here - such as foreach defalting to char for string 
when dchar is really what you shoud be iterating with - and there are times 
when you want to use a string with a range-based function and can't, because 
it needs a random-access range or one which is sliceable to do what it does, 
which can be annoying. But what else can you do there? You can't treat the 
string as a range of code units in that case. The result would be completely 
wrong. Imagine if sort worked on a char[]. You'd get an array of sorted code 
units, which would _not_ be code points, and which would be completely 
useless. So, treating a string as a range of code units makes no sense.

We could switch to having a struct of some kind which was a string, make it a 
range of dchar, and have it contain an array of char, wchar, or dchar 
internally. It would have to restrict its operations in exactly the same 
manner that the range functions for strings currently do, so the exact same 
algorithms would or wouldn't work with it. And then you'd need to provide 
access to the underlying array of code units so that algorithms special casing 
strings could operate on the array instead. Ultimately, it's pretty much the 
same thing, except now you have a wrapper struct. How does that buy you 
anything? The _only_ thing that it would buy you AFAIK is that foreach would 
then default to dchar instead of the code unit type. The basic problem still 
exists. You still need to special case strings for efficiency, and you still 
need to treat them as a range of dchar in the general case. It's an inherent 
issue with variable length encodings. You can't just magically make it go 
away.

If you have a better solution, please share it, but the fact that we want both 
efficiency and correctness binds us pretty thoroughly here.

- Jonathan M Davis