Ranges

Fri Mar 18 10:53:23 PDT 2011

On Friday, March 18, 2011 03:32:35 spir wrote:
> On 03/18/2011 10:29 AM, Peter Alexander wrote:
> > On 13/03/11 12:05 AM, Jonathan M Davis wrote:
> >> So, when you're using a range of char[] or wchar[], you're really using
> >> a range of dchar. These ranges are bi-directional. They can't be
> >> sliced, and they can't be indexed (since doing so would likely be
> >> invalid). This generally works very well. It's exactly what you want in
> >> most cases. The problem is that that means that the range that you're
> >> iterating over is effectively of a different type than
> >> the original char[] or wchar[].
> > 
> > This has to be the worst language design decision /ever/.
> > 
> > You can't just mess around with fundamental principles like "the first
> > element in an array of T has type T" for the sake of a minor
> > convenience. How are we supposed to do generic programming if common
> > sense reasoning about types doesn't hold?
> > 
> > This is just std::vector<bool> from C++ all over again. Can we not learn
> > from mistakes of the past?
> 
> I partially agree, but. Compare with a simple ascii text: you could iterate
> over it chars (=codes=bytes), words, lines... Or according to specific
> schemes for your app (eg reverse order, every number in it, every word at
> start of line...). A piece of is not only a stream of codes.
> 
> The problem is there is no good decision, in the case of char[] or wchar[].
> We should have to choose a kind of "natural" sense of what it means to
> iterate over a text, but there no such thing. What does it *mean*? What is
> the natural unit of a text?
> Bytes or words are code units which mean nothing. Code units (<-> dchars)
> are not guaranteed to mean anything neither (as shown by past discussion:
> a code unit may be the base 'a', the following one be the composite '^',
> both in "â"). Code unit do not represent "characters" in the common sense.
> So, it is very clear that implicitely iterating over dchars is a wrong
> choice. But what else? I would rather get rid of wchar and dchar and deal
> with plain stream of bytes supposed to represent utf8. Until we get a good
> solution to operate at the level of "human" characters.

Iterating over dchars works in _most_ cases. Iterating over chars only works for 
pure ASCII. The additional overhead for dealing with graphemes instead of code 
points is almost certainly prohibitive, it _usually_ isn't necessary, and we 
don't have an actualy grapheme solution yet. So, treating char[] and wchar[] as 
if their elements were valid on their own is _not_ going to work. Treating them 
along with dchar[] as ranges of dchar _mostly_ works. We definitely should have a 
way to handle them as ranges of graphemes for those who need to, but the code 
point vs grapheme issue is nowhere near as critical as the code unit vs code 
point issue.

I don't really want to get into the whole unicode discussion again. It has been 
discussed quite a bit on the D list already. There is no perfect solution. The 
current solution _mostly_ works, and, for the most part IMHO, is the correct 
solution. We _do_ need a full-on grapheme handling solution, but a lot of stuff 
doesn't need that and the overhead for dealing with it would be prohibitive. The 
main problem with using code points rather than graphemes is the lack of 
normalization, and a _lot_ of string code can get by just fine without that.

So, we have a really good 90% solution and we still need a 100% solution, but 
using the 100% all of the time would almost certainly not be acceptable due to 
performance issues, and doing stuff by code unit instead of code point would be 
_really_ bad. So, what we have is good and will likely stay as is. We just need 
a proper grapheme solution for those who need it.

- Jonathan M Davis

P.S. Unicode is just plain ugly.... :(