standard ranges

Wed Jun 27 14:28:58 PDT 2012

On Wednesday, June 27, 2012 17:11:56 Steven Schveighoffer wrote:
> On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr at gmx.ch> wrote:
> > On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
> >> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr at gmx.ch>
> >> 
> >> wrote:
> >>> There is no reason for anyone to be confused about this endlessly. It
> >>> is simple to understand. Furthermore, think about the implications of a
> >>> library-defined string type: it just introduces the problem of what the
> >>> type of built-in string literals should be. This would cause endless
> >>> pain with type deduction, ifti, string mixins, ... A library-defined
> >>> string type cannot be a full string type. Pretending that it can has no
> >>> value.
> >> 
> >> Default type of the literal should be the library type.
> > 
> > Then it is not a library type, but a built-in type. Are you planning to
> > inject a dependency on Phobos into the compiler?
> 
> No, druntime, and include minimal utf support. We do the same thing with
> AssociativeArray.
> 
> >> If you want immutable(char)[], use "abc".codeunits or equivalent.
> > 
> > I really don't want to type .codeunits, but I want to use
> > immutable(char)[] everywhere. This 'library type' is just an interface
> > change that makes writing nice and efficient code a kludge.
> 
> When most string functions take strings, why would you want to use
> immutable(char)[] everywhere?
> 
> >> Of course, it should by default work as a zero-terminated char * for C
> >> compatibility.
> >> 
> >> The current situation is not simple to understand.
> > 
> > It is simple, even if not immediately obvious. It does not have to be
> > immediately obvious without explanation. It needs to be convenient.
> 
> Try sorting an array of ascii characters.

Cast it to ubyte[]. Problem solved. I honestly don't think that operating on 
code units like that should be encourage at all, so if it's a bit hard to do, 
then that's a _good_ thing (but since all that's required is casting to 
ubyte[], it's still quite easy - you just have to tell the compiler that 
that's what you really want to do rather than it being the default behavior). 
The problem that we have is the inconsistencies between how the language 
treats strings and how the library does, not the fact that operating on char[] 
as if it were ASCII rather than UTF-8 requires some casting.

> >> Generic code that accepts arrays has to special-case narrow-width
> >> strings if you plan to
> >> use phobos with them in some cases. That is a horrible situation.
> > 
> > Generic code accepts ranges, not arrays. All necessary (or maybe
> > unnecessary, I don't know) special casing is already done for you in
> > Phobos. The _only_ thing that is problematic is the inconsistent
> > 'foreach' behaviour.
> 
> Plenty of generic code specializes on arrays.

You're stuck doing that regardless of how strings are represented. You have to 
operate on them as ranges of code points (or even graphemes) if you want 
correct string processing, but that's inefficient, so anything caring about 
efficiency which can gain extra efficiency by coding with knowledge of how unicode 
works and operate on the code units will need to special case. Whether string 
is an array or a struct has zero effect on that. All that it affects is what 
operates on it as an array of code units vs a range of code points.

> >>> alias immutable(char)[] string is just fine.
> >> 
> >> That is technically fine, but if phobos wants to treat immutable(char)[]
> >> as something other than an array, it is not fine.
> >> 
> >> -Steve
> > 
> > Phobos does not treat immutable(char)[] as something other than an
> > array. It does not treat all arrays uniformly though.
> 
> It certainly does. An array by definition is a random-access range. It
> does not treat strings as random access ranges.

Well, now you're getting into a semantics argument. isRandomAccessRange defines 
what a random access range is. All arrays which aren't narrow strings qualify. 
Narrow strings do not. Yes, they do have random-access operations, but they 
aren't random-access ranges, because they're ranges of code points, not code 
units.

Yes, this makes it so that character arrays are treated inconsistently from 
other arrays, but the library is very consistent in how it handles them, 
because it _never_ deals with strings as being made of code units. If it's 
operating on them as arrays, then it takes unicode into account, and if it's 
operating on them as ranges, it treats them as ranges of code points. It 
_always_ makes sure that it's operating on code points. Plenty of code 
specializes on strings so that it can deal with the code units in an efficient 
manner rather than having to decode them all the time, but Phobos is 
completely consistent with regards to how it treats strings. The _only_ 
inconsintencies are between the language and the library - namely how foreach 
iterates on code units by default and the fact that while the language defines 
length, slicing, and random-access operations for strings, the library 
effectively does not consider strings to have them.

- Jonathan M Davis