D's confusing strings (was Re: D on hackernews)
Jonathan M Davis
jmdavisProg at gmx.com
Wed Sep 21 13:26:40 PDT 2011
On Wednesday, September 21, 2011 19:56:47 Christophe wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
> > 1. drop says nothing about slicing.
> > 2. popFrontN (which drop calls) says that it slices for ranges that
> > support slicing. Strings do not unless they're arrays of dchar.
> >
> > Yes, hasSlicing should probably be clearer about narrow strings, but
> > that has nothing to do with drop.
>
> I never said there was a problem with drop.
Yes you did. You said:
"mini-quiz: what should std.range.drop(some_string, 1) do ?
hint: what it actually does is not what the documentation of phobos
suggests*..."
> After having read all of you, I have no problems with string being a
> lazy range of dchar. But I have a problem with immutable(char)[] being
> lazy range of dchar (ie not being a array), and I have a problem with
> string being immutable(char)[] (ie providing length opIndex and
> opSlice).
For efficiency, you need to be able to treat strings as arrays of code units for
some algorithms. For correctness, you need to be able to treat them as ranges
of code points (dchar) in the general case. You need both. The question is how
to provide that. strings as arrays came first (D1), whereas ranges came later.
We _need_ to treat strings as ranges of dchar or they're essentially unusable
in the general case. Operating on code units is almost always _wrong_. So,
when we added the range functions, we special-cased them for strings so that
strings are treated as ranges of dchar as they need to be. And in cases where
you actually need to treat a string as an array of code units for efficiency,
you special case the function for them, and you still get that. What other way
would you do it?
There _are_ some edges here - such as foreach defalting to char for string
when dchar is really what you shoud be iterating with - and there are times
when you want to use a string with a range-based function and can't, because
it needs a random-access range or one which is sliceable to do what it does,
which can be annoying. But what else can you do there? You can't treat the
string as a range of code units in that case. The result would be completely
wrong. Imagine if sort worked on a char[]. You'd get an array of sorted code
units, which would _not_ be code points, and which would be completely
useless. So, treating a string as a range of code units makes no sense.
We could switch to having a struct of some kind which was a string, make it a
range of dchar, and have it contain an array of char, wchar, or dchar
internally. It would have to restrict its operations in exactly the same
manner that the range functions for strings currently do, so the exact same
algorithms would or wouldn't work with it. And then you'd need to provide
access to the underlying array of code units so that algorithms special casing
strings could operate on the array instead. Ultimately, it's pretty much the
same thing, except now you have a wrapper struct. How does that buy you
anything? The _only_ thing that it would buy you AFAIK is that foreach would
then default to dchar instead of the code unit type. The basic problem still
exists. You still need to special case strings for efficiency, and you still
need to treat them as a range of dchar in the general case. It's an inherent
issue with variable length encodings. You can't just magically make it go
away.
If you have a better solution, please share it, but the fact that we want both
efficiency and correctness binds us pretty thoroughly here.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list