VLERange: a range in between BidirectionalRange and RandomAccessRange

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Jan 13 20:23:10 PST 2011


On 1/13/11 7:09 PM, Michel Fortin wrote:
> On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> wrote:
>>>> Let's take a look:
>>>>
>>>> // Incorrect string code
>>>> void fun(string s) {
>>>> foreach (i; 0 .. s.length) {
>>>> writeln("The character in position ", i, " is ", s[i]);
>>>> }
>>>> }
>>>>
>>>> // Incorrect string_t code
>>>> void fun(string_t!char s) {
>>>> foreach (i; 0 .. s.codeUnits) {
>>>> writeln("The character in position ", i, " is ", s[i]);
>>>> }
>>>> }
>>>>
>>>> Both functions are incorrect, albeit in different ways. The only
>>>> improvement I'm seeing is that the user needs to write codeUnits
>>>> instead of length, which may make her think twice. Clearly, however,
>>>> copiously incorrect code can be written with the proposed interface
>>>> because it tries to hide the reality that underneath a variable-length
>>>> encoding is being used, but doesn't hide it completely (albeit for
>>>> good efficiency-related reasons).
>>>
>>> You might be looking at my previous version. The new version (recently
>>> posted) will throw an exception for that code if a multi-code-unit
>>> code-point is found.
>>
>> I was looking at your latest. It's code that compiles and runs, but
>> dynamically fails on some inputs. I agree that it's often better to
>> fail noisily instead of silently, but in a manner of speaking the
>> string-based code doesn't fail at all - it correctly iterates the code
>> units of a string. This may sometimes not be what the user expected;
>> most of the time they'd care about the code points.
>
> That's forgetting that most of the time people care about graphemes
> (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis 
wrote a library that according to him does grapheme-related stuff nobody 
else does. So apparently graphemes is not what people care about 
(although it might be what they should care about).

>>> It also supports this:
>>>
>>> foreach(i, d; s)
>>> {
>>> writeln("The character in position ", i, " is ", d);
>>> }
>>>
>>> where i is the index (might not be sequential)
>>
>> Well string supports that too, albeit with the nit that you need to
>> specify dchar.
>
> Except it breaks with combining characters. For instance, take the
> string "t̃", which is two code points -- 't' followed by combining tilde
> (U+0303) -- and you'll get the following output:
>
> The character in position 0 is t
> The character in position 1 is ̃
>
> (Note that the tilde becomes combined with the preceding space character.)
>
> The conception of character that normal people have does not match the
> notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes 
systematically. Could you please post a few links that would educate me 
and others in the mysteries of combining characters?


Thanks,

Andrei


More information about the Digitalmars-d mailing list