Ranges

Fri Mar 18 14:49:11 PDT 2011

On Friday, March 18, 2011 14:08:48 Peter Alexander wrote:
> On 18/03/11 5:53 PM, Jonathan M Davis wrote:
> > On Friday, March 18, 2011 03:32:35 spir wrote:
> >> On 03/18/2011 10:29 AM, Peter Alexander wrote:
> >>> On 13/03/11 12:05 AM, Jonathan M Davis wrote:
> >>>> So, when you're using a range of char[] or wchar[], you're really
> >>>> using a range of dchar. These ranges are bi-directional. They can't
> >>>> be sliced, and they can't be indexed (since doing so would likely be
> >>>> invalid). This generally works very well. It's exactly what you want
> >>>> in most cases. The problem is that that means that the range that
> >>>> you're iterating over is effectively of a different type than
> >>>> the original char[] or wchar[].
> >>> 
> >>> This has to be the worst language design decision /ever/.
> >>> 
> >>> You can't just mess around with fundamental principles like "the first
> >>> element in an array of T has type T" for the sake of a minor
> >>> convenience. How are we supposed to do generic programming if common
> >>> sense reasoning about types doesn't hold?
> >>> 
> >>> This is just std::vector<bool>  from C++ all over again. Can we not
> >>> learn from mistakes of the past?
> >> 
> >> I partially agree, but. Compare with a simple ascii text: you could
> >> iterate over it chars (=codes=bytes), words, lines... Or according to
> >> specific schemes for your app (eg reverse order, every number in it,
> >> every word at start of line...). A piece of is not only a stream of
> >> codes.
> >> 
> >> The problem is there is no good decision, in the case of char[] or
> >> wchar[]. We should have to choose a kind of "natural" sense of what it
> >> means to iterate over a text, but there no such thing. What does it
> >> *mean*? What is the natural unit of a text?
> >> Bytes or words are code units which mean nothing. Code units (<-> 
> >> dchars) are not guaranteed to mean anything neither (as shown by past
> >> discussion: a code unit may be the base 'a', the following one be the
> >> composite '^', both in "â"). Code unit do not represent "characters" in
> >> the common sense. So, it is very clear that implicitely iterating over
> >> dchars is a wrong choice. But what else? I would rather get rid of
> >> wchar and dchar and deal with plain stream of bytes supposed to
> >> represent utf8. Until we get a good solution to operate at the level of
> >> "human" characters.
> > 
> > Iterating over dchars works in _most_ cases. Iterating over chars only
> > works for pure ASCII. The additional overhead for dealing with graphemes
> > instead of code points is almost certainly prohibitive, it _usually_
> > isn't necessary, and we don't have an actualy grapheme solution yet. So,
> > treating char[] and wchar[] as if their elements were valid on their own
> > is _not_ going to work. Treating them along with dchar[] as ranges of
> > dchar _mostly_ works. We definitely should have a way to handle them as
> > ranges of graphemes for those who need to, but the code point vs
> > grapheme issue is nowhere near as critical as the code unit vs code
> > point issue.
> > 
> > I don't really want to get into the whole unicode discussion again. It
> > has been discussed quite a bit on the D list already. There is no
> > perfect solution. The current solution _mostly_ works, and, for the most
> > part IMHO, is the correct solution. We _do_ need a full-on grapheme
> > handling solution, but a lot of stuff doesn't need that and the overhead
> > for dealing with it would be prohibitive. The main problem with using
> > code points rather than graphemes is the lack of normalization, and a
> > _lot_ of string code can get by just fine without that.
> > 
> > So, we have a really good 90% solution and we still need a 100% solution,
> > but using the 100% all of the time would almost certainly not be
> > acceptable due to performance issues, and doing stuff by code unit
> > instead of code point would be _really_ bad. So, what we have is good
> > and will likely stay as is. We just need a proper grapheme solution for
> > those who need it.
> > 
> > - Jonathan M Davis
> > 
> > 
> > P.S. Unicode is just plain ugly.... :(
> 
> I must be missing something, because the solution seems obvious to me:
> 
> char[], wchar[], and dchar[] should be simple arrays like int[] with no
> unicode semantics.
> 
> string, wstring, and dstring should not be aliases to arrays, but
> instead should be separate types that behave the way char[], wchar[],
> and dchar[] do currently.
> 
> Is there any problem with this approach?

There has been a fair bit of debate about it in the past. No one has been able 
to come up with an alternate solution which is generally considered better than 
what we have.

char is defined to be a UTF-8 code unit. wchar in defined to be a UTF-16 code 
unit. dchar is defined to be a UTF-32 code unit (which is also guaranteed to be a 
code point). So, manipulating char[] and wchar[] as arrays of characters doesn't 
generally make any sense. They _aren't_ characters. They're code units. Having a 
range of char or wchar generally makes no sense.

When you don't care about the contents of a string, treating it as an array is 
very useful. When you _do_ care, you need to treat it as a range of dchar - no 
matter which unicode encoding it uses. So, having arrays of code units which are 
treated as ranges of dchar makes a lot of sense.

We could have a wrapper type which wrapped arrays of char, wchar, or dchar and 
had the appropriate operations on them and was a range of dchar, but then you'd 
have to get at the underlying array for various stuff. So, whether it's a gain or 
loss is debatable. You have to special-case strings _regardless_. For some stuff, 
they need to be treated as arrays of code units, and for other stuff they need to 
be treated as ranges of code points.

As it stands, range-based functions treat char[], wchar[], etc. properly. 
Allowing char[] to be treated as a range of char would just cause bugs in most 
cases. Generally speaking, when someone is trying to do stuff like use char[] as 
an output range, it _shouldn't_ work. In almost all cases, it would just be 
buggy to treat char[] as a range of char and allow that to work (which would 
happen if we treated char[] as a range of char).

So, in almost all cases where treating char[] as a range of dchar causes 
problems, it's _preventing bugs_. The one glaring problem with the current 
scheme is foreach. It defaults to the element type of the array, so if you don't 
give dchar as the element type when iterating over a char[] or wchar[], then 
you're going to have bugs. There have been suggestions on how to fix that (such 
as giving a warning when not giving the iteration type for char[] and wchar[] 
when using foreach), but nothing has been implemented yet.

Overall, using arrays for strings (as D has done since pretty much forever) 
works really well. It's just that char[] and wchar[] cannot be treated as ranges 
of char or wchar or you're just asking for problems. And on the whole, the 
current solution works quite well, and operations that are disallowed by it are 
_supposed_ to be disallowed.

- Jonathan M Davis