VLERange: a range in between BidirectionalRange and RandomAccessRange

Thu Jan 13 19:09:52 PST 2011

On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> wrote:
>>> Let's take a look:
>>> 
>>> // Incorrect string code
>>> void fun(string s) {
>>> foreach (i; 0 .. s.length) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>> 
>>> // Incorrect string_t code
>>> void fun(string_t!char s) {
>>> foreach (i; 0 .. s.codeUnits) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>> 
>>> Both functions are incorrect, albeit in different ways. The only
>>> improvement I'm seeing is that the user needs to write codeUnits
>>> instead of length, which may make her think twice. Clearly, however,
>>> copiously incorrect code can be written with the proposed interface
>>> because it tries to hide the reality that underneath a variable-length
>>> encoding is being used, but doesn't hide it completely (albeit for
>>> good efficiency-related reasons).
>> 
>> You might be looking at my previous version. The new version (recently
>> posted) will throw an exception for that code if a multi-code-unit
>> code-point is found.
> 
> I was looking at your latest. It's code that compiles and runs, but 
> dynamically fails on some inputs. I agree that it's often better to 
> fail noisily instead of silently, but in a manner of speaking the 
> string-based code doesn't fail at all - it correctly iterates the code 
> units of a string. This may sometimes not be what the user expected; 
> most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes 
(user-perceived characters), not code points.

>> It also supports this:
>> 
>> foreach(i, d; s)
>> {
>> writeln("The character in position ", i, " is ", d);
>> }
>> 
>> where i is the index (might not be sequential)
> 
> Well string supports that too, albeit with the nit that you need to 
> specify dchar.

Except it breaks with combining characters. For instance, take the 
string "t̃", which is two code points -- 't' followed by combining 
tilde (U+0303) -- and you'll get the following output:

	The character in position 0 is t
	The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space character.)

The conception of character that normal people have does not match the 
notion of code points when combining characters enters the equation.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/