D's confusing strings (was Re: D on hackernews)
Jonathan M Davis
jmdavisProg at gmx.com
Wed Sep 21 12:10:37 PDT 2011
On Wednesday, September 21, 2011 11:20 Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> > On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
> >> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
> >> > unicode natively. Yet the 'D strings are strange and confusing'
> >> > argument comes up quite often on the web.
> >>
> >> Well, I think they are. The ptr+length stuff is amasing, but the
> >> behavior of strings in phobos is weird.
> >>
> >> mini-quiz: what should std.range.drop(some_string, 1) do ?
> >> hint: what it actually does is not what the documentation of phobos
> >> suggests*...
> >
> > What do you mean? It does exactly what it says that it does.
>
> It does say it uses the slice operator if the range is sliceable, and
> the documentation to isSliceable fails to precise that a narrow string
> is not sliceable.
1. drop says nothing about slicing.
2. popFrontN (which drop calls) says that it slices for ranges that support
slicing. Strings do not unless they're arrays of dchar.
Yes, hasSlicing should probably be clearer about narrow strings, but that has
nothing to do with drop.
> > Yeah, well, as long as char is a unicode code unit, that's the way that
> > it goes.
>
> They are not unicode units.
>
> void main() {
> char a = 'ä';
> writeln(a); // outputs: \344
> writeln('ä'); // outputs: ä
> }
>
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.
>
> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.
The problem with
char a = 'ä';
is completely separate from char[]. That has to do with the fact that the
compiler isn't properly dealing with narrowing conversions for an individual
character. A char is _by definition_ a UTF-8 code unit.
char a = 'ä';
shouldn't even be legal. It's a compiler bug. Most code which operates on
individual chars or wchars is buggy. dchar is what should be used for
individual characters. The code you give is buggy because it's trying to use
char as individual character, and the compiler has a bug, so it doesn't catch
it.
> > In general, Phobos does a good job of using slicing and other
> > optimizations on strings when it can in spite of the fact that they're
> > not sliceable ranges, but there are cases where the fact that you _have_
> > to process them to be able to find the nth code point means that you
> > just can't process them as efficiently as your typical array. That's
> > life with a variable- length encoding. - and that includes
> > std.algorithm.copy. But that's an easy one to get around, since if you
> > wanted to ignore unicode safety and just copy some chunk of the string,
> > you can always just slice it directly with no need for copy.
>
> Dealing with utfencoded strings is less efficient, but there is a number
> of algorithms that can be optimized for utfencoded strings, like copying
> or finding an ascii char in a string. Unfortunately, there is no
> practical way to do this with the current range API.
>
> About copy, it's not that easy to overcome the problem if you are using
> a template, and that template happens to be instanciated for strings.
In general, you _must_ treat strings as ranges of dchar if you want your code
to be correct with regards to code points. So, that's the default. If you know
that your particular algorithm can be optimized for strings without treating
them strictly as ranges of dchar and still deal with unicode correctly (e.g.
by slicing them, because you know the correct point to slice to), then you
special-case your template for narrow strings. Phobos does that in a number of
places. But you _have_ to special case it because being able to do so depends
entirely on your algorithm. You can't treat strings that way in the general
case, or yor code is not going to handle unicode correctly.
No, the way that strings are handled in D is not perfect. But we're trying to
balance correctness and efficiency, and it does a good job of that. Using
strings as ranges of dchar is correct, and in a large number of cases, it is
as efficient as your going to get it (perhaps particular functions could be
better optimized, but the API can't be made more efficient). And in the cases
where you know you can safely get better efficiency out of strings by special
casing them, that's what you do. It works. It works fairly well. And no one
has been able to come up with a solution that's definitely better.
If there's an issue, it's that the message on how to correctly handle strings
is not necessarily being communicated as well as it needs to be. In a few
places, the documentation could probably be improved, but what we probobly
really need in order to help get the message across is more articles on ranges
and strings as ranges. I've actually partially written an article on that, but
due to some compiler bugs regarding std.container, I ended up temporarily
shelving it.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list