typeof(string.front) should be char
Piotr Szturmaj
bncrbme at jadamspam.pl
Sat Mar 3 05:57:59 PST 2012
Jonathan M Davis wrote:
> On Friday, March 02, 2012 20:41:35 Ali Çehreli wrote:
>> On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
>> > Hello,
>> >
>> > For this code:
>> >
>> > auto c = "test"c;
>> > auto w = "test"w;
>> > auto d = "test"d;
>> > pragma(msg, typeof(c.front));
>> > pragma(msg, typeof(w.front));
>> > pragma(msg, typeof(d.front));
>> >
>> > compiler prints:
>> >
>> > dchar
>> > dchar
>> > immutable(dchar)
>> >
>> > IMO it should print this:
>> >
>> > immutable(char)
>> > immutable(wchar)
>> > immutable(dchar)
>> >
>> > Is it a bug?
>>
>> No, that's by design. When used as InputRange ranges, slices of any
>> character type are exposed as ranges of dchar.
>
> Indeed.
>
> Strings are always treated as ranges of dchar, because it generally makes no
> sense to operate on individual chars or wchars. A char is a UTF-8 code unit. A
> wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _only_ one
> of those which is guranteed to be a code point is dchar, since in UTF-32, all
> code points are a single code unit. If you were to operate on individual chars
> or wchars, you'd be operating on pieces of characters rather than whole
> characters, which wreaks havoc with unicode.
>
> Now, technically speaking, a code point isn't necessarily a full character,
> since you can also combine code points (e.g. adding a subscript to a letter),
> and a full character is what's called a grapheme, and unfortunately, at the
> moment, Phobos doesn't have a way to operate on graphemes, but operating on
> code points is _far_ more correct than operating on code units. It's also more
> efficient.
>
> Unfortunately, in order to code completely efficiently with unicode, you have
> understand quite a bit about it, which most programmers don't, but by
> operating on ranges of code points, Phobos manages to be correct in the
> majority of cases.
I know about Unicode, code units/points and their encoding.
> So, yes. It's very much on purpose that all strings are treated as ranges of
> dchar.
Foreach gives opportunity to handle any string by char, wchar or dchar,
the default dchar is appropriate here, but why for ranges?
I was afraid it is on purpose, because it has some bad consequences. It
breaks genericity when dealing with ranges. Consider a custom range of char:
struct CharRange
{
@property bool empty();
@property char front();
void popFront();
}
typeof(CharRange.front) and ElementType!CharRange both return _char_
while for string they return _dchar_. This discrepancy pushes the range
writer to handle special string cases. I'm currently trying to write
ByDchar range:
template ByDchar(R)
if (isInputRange!R && isSomeChar!(ElementType!R))
{
alias ElementType!R E;
static if (is(E == dchar))
alias R ByDchar;
else static if (is(E == char))
{
struct ByDchar
{
...
}
}
else static if (is(E == wchar))
{
...
}
}
The problem with that range is when it takes a string type, it aliases
this type with itself, because ElementType!R yields dchar. This is why
I'm talking about "bad consequences", I just want to iterate string by
_char_, not _dchar_.
More information about the Digitalmars-d-learn
mailing list