Beginner not getting "string"

Sun Aug 29 11:17:44 PDT 2010

On 08/29/2010 12:44 PM, Nick wrote:
> Reading Andrei's book and something seems amiss:
>
> 1. A char in D is a code *unit* not a code point. Considering that code
> units are generally used to encode in an encoding, I would have expected
> that the type for a code unit to be byte or something similar, as far
> from code points as possible. In my mind, Unicode characters, aka chars
> are code points.

(Background for others: code point == actual conceptual character, code 
unit == the smallest unit of encoding (one byte for UTF8, two bytes for 
UTF16, four bytes for UTF32). In UTF32 code units are chosen to be equal 
to code points.)

Indeed, D's char is a UTF-8 code unit, and wchar is a UTF-16 code point. 
(dchar is at the same time a UTF-32 code unit and a Unicode code point.)

Making the type of a code unit byte would considerably weaken the 
expressive power because an array of byte[] could be considered either 
untyped data or UTF-encoded data without a static means to differentiate 
between the two. This would be largely obviated by making string an 
elaborate type, but there are considerable advantages to having string a 
regular array type.

> 2. Thus a string in D is an array of code *units*, although in Unicode a
> string is really an array of code points.

In Unicode a string is generally a _sequence_ of code points. Due to the 
variable-length encoding enacted by UTF-8 and UTF-16, it would be 
difficult to emulate array semantics on such representations.

> 3. Iterating a string in D is wrong by default, iterating over code
> units instead of characters (code points). Even worse, the error does
> not appear until you put some non-ascii text in there.

It's been discussed before that foreach (c; str) should set by default 
the type of c to dchar. I agree. That being said, iterating a string 
with the formal iteration mechanism defined by std.range is always 
correct and moves one code point at a time.

So what I can advise is to use foreach (dchar c; str). Other than that, 
everything should work properly.

> 4. All string-processing calls (like sort, toupper, split and such) are
> by default wrong on non-ascii strings. Wrong without any error, warning
> or anything.

You'll be glad to hear that this assumption is false.

1. sort does not compile for char[] or wchar[]. The reason is that 
char[] and wchar[] do not obey the random-access requirements.

2. All overloads of split work correctly with non-ASCII strings. If you 
find anything that doesn't, that's a bug in the implementation, not in 
the design. I also recommend you look up splitter in std.algorithm.

> So I guess my question is why, in a language with the power and
> expressiveness of D, in our day and age, would one choose such an
> exposed, fragile implementation of string that ensures that the default
> code one writes for text manipulation is most likely wrong?
>
> I18N is one of the first things I judge a new language by and so far D
> is... puzzling.
>
> I don't know much about D so I am probably just not getting it but can
> you please point me to some rationale behind these string design decisions?

Support of UTF in D could be better but it definitely compares favorably 
to that in many other languages (including all languages that I know). 
The choice of array clarifies the representation and offer random access 
to individual code units, which is sometimes necessary for efficient 
manipulation. However, the formal range interface offers bidirectional 
access to code points.

As I mentioned elsewhere, I could not find an edit distance 
implementation for any other language than D that works directly on 
UTF-encoded inputs. And it's not special-cased - the same implementation 
works e.g. for lists of integers.

Andrei