Beginner not getting "string"

dsimcha dsimcha at yahoo.com
Sun Aug 29 11:02:25 PDT 2010


== Quote from Nick (nick at example.com)'s article
> Reading Andrei's book and something seems amiss:
> 1. A char in D is a code *unit* not a code point. Considering that code
> units are generally used to encode in an encoding, I would have expected
> that the type for a code unit to be byte or something similar, as far
> from code points as possible. In my mind, Unicode characters, aka chars
> are code points.

Basically, the reason is because you can't have a regular array of code points,
you'd need to maintain some additional data structures.  These can easily be built
on top of an array of code units.  You can't build an array of code units on top
of an array of code points, at least not efficiently.
> 2. Thus a string in D is an array of code *units*, although in Unicode a
> string is really an array of code points.

This is admittedly a wart.  However, when you use std.range, ElementType!(string)
== dchar, so iterating with range primitives does what you'd think it should.

> 3. Iterating a string in D is wrong by default, iterating over code
> units instead of characters (code points). Even worse, the error does
> not appear until you put some non-ascii text in there.

See answer to (2).

> 4. All string-processing calls (like sort, toupper, split and such) are
> by default wrong on non-ascii strings. Wrong without any error, warning
> or anything.

If these don't work right for non-ASCII strings then it's an bug, not a design
issue.  Please file bug reports (1 per function).

> So I guess my question is why, in a language with the power and
> expressiveness of D, in our day and age, would one choose such an
> exposed, fragile implementation of string that ensures that the default
> code one writes for text manipulation is most likely wrong?
> I18N is one of the first things I judge a new language by and so far D
> is... puzzling.

Part of it was a silly design error that became too hard to change.  Most of it,
though, is to avoid abstraction inversion
(http://en.wikipedia.org/wiki/Abstraction_inversion) by providing access to the
lower level aspects of unicode strings.


More information about the Digitalmars-d mailing list