typeof(string.front) should be char

Piotr Szturmaj bncrbme at jadamspam.pl
Sat Mar 3 05:57:59 PST 2012


Jonathan M Davis wrote:
> On Friday, March 02, 2012 20:41:35 Ali Çehreli wrote:
>> On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
>>   >  Hello,
>>   >
>>   >  For this code:
>>   >
>>   >  auto c = "test"c;
>>   >  auto w = "test"w;
>>   >  auto d = "test"d;
>>   >  pragma(msg, typeof(c.front));
>>   >  pragma(msg, typeof(w.front));
>>   >  pragma(msg, typeof(d.front));
>>   >
>>   >  compiler prints:
>>   >
>>   >  dchar
>>   >  dchar
>>   >  immutable(dchar)
>>   >
>>   >  IMO it should print this:
>>   >
>>   >  immutable(char)
>>   >  immutable(wchar)
>>   >  immutable(dchar)
>>   >
>>   >  Is it a bug?
>>
>> No, that's by design. When used as InputRange ranges, slices of any
>> character type are exposed as ranges of dchar.
>
> Indeed.
>
> Strings are always treated as ranges of dchar, because it generally makes no
> sense to operate on individual chars or wchars. A char is a UTF-8 code unit. A
> wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _only_ one
> of those which is guranteed to be a code point is dchar, since in UTF-32, all
> code points are a single code unit. If you were to operate on individual chars
> or wchars, you'd be operating on pieces of characters rather than whole
> characters, which wreaks havoc with unicode.
>
> Now, technically speaking, a code point isn't necessarily a full character,
> since you can also combine code points (e.g. adding a subscript to a letter),
> and a full character is what's called a grapheme, and unfortunately, at the
> moment, Phobos doesn't have a way to operate on graphemes, but operating on
> code points is _far_ more correct than operating on code units. It's also more
> efficient.
>
> Unfortunately, in order to code completely efficiently with unicode, you have
> understand quite a bit about it, which most programmers don't, but by
> operating on ranges of code points, Phobos manages to be correct in the
> majority of cases.

I know about Unicode, code units/points and their encoding.

> So, yes. It's very much on purpose that all strings are treated as ranges of
> dchar.

Foreach gives opportunity to handle any string by char, wchar or dchar, 
the default dchar is appropriate here, but why for ranges?

I was afraid it is on purpose, because it has some bad consequences. It 
breaks genericity when dealing with ranges. Consider a custom range of char:

struct CharRange
{
     @property bool empty();
     @property char front();
     void popFront();
}

typeof(CharRange.front) and ElementType!CharRange both return _char_ 
while for string they return _dchar_. This discrepancy pushes the range 
writer to handle special string cases. I'm currently trying to write 
ByDchar range:

template ByDchar(R)
      if (isInputRange!R && isSomeChar!(ElementType!R))
{
     alias ElementType!R E;
     static if (is(E == dchar))
         alias R ByDchar;
     else static if (is(E == char))
     {
         struct ByDchar
         {
             ...
         }
     }
     else static if (is(E == wchar))
     {
         ...
     }
}

The problem with that range is when it takes a string type, it aliases 
this type with itself, because ElementType!R yields dchar. This is why 
I'm talking about "bad consequences", I just want to iterate string by 
_char_, not _dchar_.


More information about the Digitalmars-d-learn mailing list