standard ranges

Wed Jun 27 10:30:48 PDT 2012

On Wednesday, June 27, 2012 19:54:12 Gor Gyolchanyan wrote:
> On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis 
<jmdavisProg at gmx.com>wrote:
> > On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
> > > On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis
> > 
> > <jmdavisProg at gmx.com>wrote:
> > > > On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
> > > > > I tested it out and the string literal without qualifiers counts as
> > > > > a
> > > > > dstring.
> > > > 
> > > > That depends entirely on what you assign it to.
> > > > writeln(typeof("hello").stringof) prints string, not dstring. So, the
> > > > literal
> > > > by itself is a string by default.
> > > > 
> > > > - Jonathan M Davis
> > > 
> > > this is weird. I wrote a function, which transforms anything, which
> > > qualifies as isForwardRange into an implementation of ForwardRange. And
> > 
> > the
> > 
> > > type inference of that function produced a ForwardRangeImpl!dchar when I
> > > passed it a string literal.
> > > 
> > > Although string and wstring also qualify as a forward range.
> > 
> > _All_ strings are considered to be ranges of dchar. That's why string and
> > wstring are not random access ranges and hasLength is false for them.
> > 
> > - Jonathan M Davis
> 
> So why is the type of a string literal _string_ by default? Isn't it
> confusing when dealing with ranges?

I don't see why having the literal be a string would make anything confusing. 
The fact that a string is considered a range of dchar rather than char could 
be, but I don't see why having a string literal be a dstring instead of a 
string would help with that. Besides, it's generally expected that you'll use 
string for strings unless you specifically need wstring or dstring for some 
reason.

Regardless, ranges aren't really part of the language. They're a library 
artifact. The _only_ place that the language has anything to do with them is 
foreach, in which case

foreach(e; range)
{
 // code
}

becomes

for(auto _range = range; !_range.empty; _range.popFront())
{
 auto e _range.front;
 // code
}

That's it. So, the fact that Phobos treats strings as ranges of dchar is 
completely separate from what the language is doing with string literals. 
foreach on strings doesn't iterate over dchars unless you specifically give 
dchar as the element type. You can get a strings length. You can use random 
access on it. You can slice it. But this falls apart _very_ quickly with 
general algorithms, because a string is an array of code _units_ rather than 
code points. So, if you iterate over char, you're iterating over pieces of 
characters rather than whole characters. So, Phobos' solution is to treat 
arrays of char and wchar as ranges of dchar rather than ranges of char and 
wchar, and they lose length, random access, and slicing as far as ranges are 
concerned (though algorithms can special case for them and use those abilities 
where appropriate, since they're still there - they just can't be used 
generically or you'd be operating on code units).

In some cases, you need to be able to treat strings as arrays of code units, 
while in others you need to treat them as arrays of code points. In order to 
use strings properly, you need to understand that. There's no way around it. 
It's life with unicode. The library went the route of using code points for 
everything because it's more correct and less error-prone, whereas the 
language itself generally deals with code units This does create a bit of 
schizophrenia when dealing with built-in stuff (such as foreach) and library 
stuff, but that's the way that it goes at this point.

If strings were a struct of some kind that defaulted to using code points but 
allowed you to use code units when necessary, then the situation could be 
improved, but no one has been able to come up with a satisfactory proposal to 
do that, and it would break so much code at this point to change what string 
was aliased to that it's unlikely to ever happen. Not to mention, it doesn't 
really fix the underlying problem of having to know and worry about code units 
vs code points. They're intrinsic to unicode, and you can't really fix that. 
There's no way around it if you want to able to efficiently operate on strings.

- Jonathan M Davis