Beginner not getting "string"
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sun Aug 29 11:17:44 PDT 2010
On 08/29/2010 12:44 PM, Nick wrote:
> Reading Andrei's book and something seems amiss:
>
> 1. A char in D is a code *unit* not a code point. Considering that code
> units are generally used to encode in an encoding, I would have expected
> that the type for a code unit to be byte or something similar, as far
> from code points as possible. In my mind, Unicode characters, aka chars
> are code points.
(Background for others: code point == actual conceptual character, code
unit == the smallest unit of encoding (one byte for UTF8, two bytes for
UTF16, four bytes for UTF32). In UTF32 code units are chosen to be equal
to code points.)
Indeed, D's char is a UTF-8 code unit, and wchar is a UTF-16 code point.
(dchar is at the same time a UTF-32 code unit and a Unicode code point.)
Making the type of a code unit byte would considerably weaken the
expressive power because an array of byte[] could be considered either
untyped data or UTF-encoded data without a static means to differentiate
between the two. This would be largely obviated by making string an
elaborate type, but there are considerable advantages to having string a
regular array type.
> 2. Thus a string in D is an array of code *units*, although in Unicode a
> string is really an array of code points.
In Unicode a string is generally a _sequence_ of code points. Due to the
variable-length encoding enacted by UTF-8 and UTF-16, it would be
difficult to emulate array semantics on such representations.
> 3. Iterating a string in D is wrong by default, iterating over code
> units instead of characters (code points). Even worse, the error does
> not appear until you put some non-ascii text in there.
It's been discussed before that foreach (c; str) should set by default
the type of c to dchar. I agree. That being said, iterating a string
with the formal iteration mechanism defined by std.range is always
correct and moves one code point at a time.
So what I can advise is to use foreach (dchar c; str). Other than that,
everything should work properly.
> 4. All string-processing calls (like sort, toupper, split and such) are
> by default wrong on non-ascii strings. Wrong without any error, warning
> or anything.
You'll be glad to hear that this assumption is false.
1. sort does not compile for char[] or wchar[]. The reason is that
char[] and wchar[] do not obey the random-access requirements.
2. All overloads of split work correctly with non-ASCII strings. If you
find anything that doesn't, that's a bug in the implementation, not in
the design. I also recommend you look up splitter in std.algorithm.
> So I guess my question is why, in a language with the power and
> expressiveness of D, in our day and age, would one choose such an
> exposed, fragile implementation of string that ensures that the default
> code one writes for text manipulation is most likely wrong?
>
> I18N is one of the first things I judge a new language by and so far D
> is... puzzling.
>
> I don't know much about D so I am probably just not getting it but can
> you please point me to some rationale behind these string design decisions?
Support of UTF in D could be better but it definitely compares favorably
to that in many other languages (including all languages that I know).
The choice of array clarifies the representation and offer random access
to individual code units, which is sometimes necessary for efficient
manipulation. However, the formal range interface offers bidirectional
access to code points.
As I mentioned elsewhere, I could not find an edit distance
implementation for any other language than D that works directly on
UTF-encoded inputs. And it's not special-cased - the same implementation
works e.g. for lists of integers.
Andrei
More information about the Digitalmars-d
mailing list