Ranges

Peter Alexander peter.alexander.au at gmail.com
Fri Mar 18 14:08:48 PDT 2011


On 18/03/11 5:53 PM, Jonathan M Davis wrote:
> On Friday, March 18, 2011 03:32:35 spir wrote:
>> On 03/18/2011 10:29 AM, Peter Alexander wrote:
>>> On 13/03/11 12:05 AM, Jonathan M Davis wrote:
>>>> So, when you're using a range of char[] or wchar[], you're really using
>>>> a range of dchar. These ranges are bi-directional. They can't be
>>>> sliced, and they can't be indexed (since doing so would likely be
>>>> invalid). This generally works very well. It's exactly what you want in
>>>> most cases. The problem is that that means that the range that you're
>>>> iterating over is effectively of a different type than
>>>> the original char[] or wchar[].
>>>
>>> This has to be the worst language design decision /ever/.
>>>
>>> You can't just mess around with fundamental principles like "the first
>>> element in an array of T has type T" for the sake of a minor
>>> convenience. How are we supposed to do generic programming if common
>>> sense reasoning about types doesn't hold?
>>>
>>> This is just std::vector<bool>  from C++ all over again. Can we not learn
>>> from mistakes of the past?
>>
>> I partially agree, but. Compare with a simple ascii text: you could iterate
>> over it chars (=codes=bytes), words, lines... Or according to specific
>> schemes for your app (eg reverse order, every number in it, every word at
>> start of line...). A piece of is not only a stream of codes.
>>
>> The problem is there is no good decision, in the case of char[] or wchar[].
>> We should have to choose a kind of "natural" sense of what it means to
>> iterate over a text, but there no such thing. What does it *mean*? What is
>> the natural unit of a text?
>> Bytes or words are code units which mean nothing. Code units (<->  dchars)
>> are not guaranteed to mean anything neither (as shown by past discussion:
>> a code unit may be the base 'a', the following one be the composite '^',
>> both in "â"). Code unit do not represent "characters" in the common sense.
>> So, it is very clear that implicitely iterating over dchars is a wrong
>> choice. But what else? I would rather get rid of wchar and dchar and deal
>> with plain stream of bytes supposed to represent utf8. Until we get a good
>> solution to operate at the level of "human" characters.
>
> Iterating over dchars works in _most_ cases. Iterating over chars only works for
> pure ASCII. The additional overhead for dealing with graphemes instead of code
> points is almost certainly prohibitive, it _usually_ isn't necessary, and we
> don't have an actualy grapheme solution yet. So, treating char[] and wchar[] as
> if their elements were valid on their own is _not_ going to work. Treating them
> along with dchar[] as ranges of dchar _mostly_ works. We definitely should have a
> way to handle them as ranges of graphemes for those who need to, but the code
> point vs grapheme issue is nowhere near as critical as the code unit vs code
> point issue.
>
> I don't really want to get into the whole unicode discussion again. It has been
> discussed quite a bit on the D list already. There is no perfect solution. The
> current solution _mostly_ works, and, for the most part IMHO, is the correct
> solution. We _do_ need a full-on grapheme handling solution, but a lot of stuff
> doesn't need that and the overhead for dealing with it would be prohibitive. The
> main problem with using code points rather than graphemes is the lack of
> normalization, and a _lot_ of string code can get by just fine without that.
>
> So, we have a really good 90% solution and we still need a 100% solution, but
> using the 100% all of the time would almost certainly not be acceptable due to
> performance issues, and doing stuff by code unit instead of code point would be
> _really_ bad. So, what we have is good and will likely stay as is. We just need
> a proper grapheme solution for those who need it.
>
> - Jonathan M Davis
>
>
> P.S. Unicode is just plain ugly.... :(

I must be missing something, because the solution seems obvious to me:

char[], wchar[], and dchar[] should be simple arrays like int[] with no 
unicode semantics.

string, wstring, and dstring should not be aliases to arrays, but 
instead should be separate types that behave the way char[], wchar[], 
and dchar[] do currently.

Is there any problem with this approach?


More information about the Digitalmars-d-learn mailing list