D's confusing strings (was Re: D on hackernews)

Simen Kjaeraas simen.kjaras at gmail.com
Wed Sep 21 12:04:21 PDT 2011


On Wed, 21 Sep 2011 20:20:55 +0200, Christophe Travert  
<travert at phare.normalesup.org> wrote:


>> Yeah, well, as long as char is a unicode code unit, that's the way that  
>> it
>> goes.
>
> They are not unicode units.
>
> void main() {
>   char a = 'ä';
>   writeln(a); // outputs: \344
>   writeln('ä'); // outputs: ä
> }
>
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.

Oh, it absolutely is. According to the Unicode Consortium, A code unit is
"The minimal bit combination that can represent a unit of encoded text
for processing or interchange. The Unicode Standard uses 8-bit code units
in the UTF-8 encoding form [...]".

What you are thinking about is a code point.


> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.

I can agree with this, but the benefits over what we already have are nigh
zilch.


-- 
   Simen


More information about the Digitalmars-d mailing list