standard ranges
Jonathan M Davis
jmdavisProg at gmx.com
Wed Jun 27 10:30:48 PDT 2012
On Wednesday, June 27, 2012 19:54:12 Gor Gyolchanyan wrote:
> On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis
<jmdavisProg at gmx.com>wrote:
> > On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
> > > On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis
> >
> > <jmdavisProg at gmx.com>wrote:
> > > > On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
> > > > > I tested it out and the string literal without qualifiers counts as
> > > > > a
> > > > > dstring.
> > > >
> > > > That depends entirely on what you assign it to.
> > > > writeln(typeof("hello").stringof) prints string, not dstring. So, the
> > > > literal
> > > > by itself is a string by default.
> > > >
> > > > - Jonathan M Davis
> > >
> > > this is weird. I wrote a function, which transforms anything, which
> > > qualifies as isForwardRange into an implementation of ForwardRange. And
> >
> > the
> >
> > > type inference of that function produced a ForwardRangeImpl!dchar when I
> > > passed it a string literal.
> > >
> > > Although string and wstring also qualify as a forward range.
> >
> > _All_ strings are considered to be ranges of dchar. That's why string and
> > wstring are not random access ranges and hasLength is false for them.
> >
> > - Jonathan M Davis
>
> So why is the type of a string literal _string_ by default? Isn't it
> confusing when dealing with ranges?
I don't see why having the literal be a string would make anything confusing.
The fact that a string is considered a range of dchar rather than char could
be, but I don't see why having a string literal be a dstring instead of a
string would help with that. Besides, it's generally expected that you'll use
string for strings unless you specifically need wstring or dstring for some
reason.
Regardless, ranges aren't really part of the language. They're a library
artifact. The _only_ place that the language has anything to do with them is
foreach, in which case
foreach(e; range)
{
// code
}
becomes
for(auto _range = range; !_range.empty; _range.popFront())
{
auto e _range.front;
// code
}
That's it. So, the fact that Phobos treats strings as ranges of dchar is
completely separate from what the language is doing with string literals.
foreach on strings doesn't iterate over dchars unless you specifically give
dchar as the element type. You can get a strings length. You can use random
access on it. You can slice it. But this falls apart _very_ quickly with
general algorithms, because a string is an array of code _units_ rather than
code points. So, if you iterate over char, you're iterating over pieces of
characters rather than whole characters. So, Phobos' solution is to treat
arrays of char and wchar as ranges of dchar rather than ranges of char and
wchar, and they lose length, random access, and slicing as far as ranges are
concerned (though algorithms can special case for them and use those abilities
where appropriate, since they're still there - they just can't be used
generically or you'd be operating on code units).
In some cases, you need to be able to treat strings as arrays of code units,
while in others you need to treat them as arrays of code points. In order to
use strings properly, you need to understand that. There's no way around it.
It's life with unicode. The library went the route of using code points for
everything because it's more correct and less error-prone, whereas the
language itself generally deals with code units This does create a bit of
schizophrenia when dealing with built-in stuff (such as foreach) and library
stuff, but that's the way that it goes at this point.
If strings were a struct of some kind that defaulted to using code points but
allowed you to use code units when necessary, then the situation could be
improved, but no one has been able to come up with a satisfactory proposal to
do that, and it would break so much code at this point to change what string
was aliased to that it's unlikely to ever happen. Not to mention, it doesn't
really fix the underlying problem of having to know and worry about code units
vs code points. They're intrinsic to unicode, and you can't really fix that.
There's no way around it if you want to able to efficiently operate on strings.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list