VLERange: a range in between BidirectionalRange and RandomAccessRange

Tue Jan 18 05:17:45 PST 2011

On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/17/11 9:48 PM, Michel Fortin wrote:
>> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin at michelf.com>
>> said:
>> 
>>> More seriously, you have four choice:
>>> 
>>> 1. code unit
>>> 2. code point
>>> 3. grapheme
>>> 4. require the client to state explicitly which kind of 'character' he
>>> wants; 'character' being an overloaded word, it's reasonable to ask
>>> for disambiguation.
>> 
>> This makes me think of what I did with my XML parser after you made code
>> points the element type for strings. Basically, the parser now uses
>> 'front' and 'popFront' whenever it needs to get the next code point, but
>> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
>> had to add) when testing for or skipping an ASCII character is
>> sufficient. This way I avoid a lot of unnecessary decoding of code points.
>> 
>> For this to work, the same range must let you skip either a unit or a
>> code point. If I were using a separate range with a call to toDchar or
>> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
>> have helped much because the new range would essentially become a new
>> slice independent of the original, so you can't interleave "I want to
>> advance by one unit" with "I want to advance by one code point".
>> 
>> So perhaps the best interface for strings would be to provide multiple
>> range-like interfaces that you can use at the level you want.
>> 
>> I'm not sure if this is a good idea, but I thought I should at least
>> share my experience.
> 
> Very insightful. Thanks for sharing. Code it up and make a solid proposal!

What I use right now is this (see below). I'm not sure what would be a 
good name for it though. The expectation is that I'll get either an 
ASCII char or something out of ASCII range if it isn't ASCII.

The abstraction doesn't seem very 'solid' to me, in the sense that I 
can't see how it'd apply to ranges other than strings, so it's only 
useful for strings (the character array kind), and it's only useful as 
a workaround since you made ElementType!(char[]) a dchar. Well, any 
range returning char,dchar,wchar could map frontUnit to front and 
popFrontUnit to popFront to keep things working, but it makes the 
optimization rather pointless. I don't really have an idea where to go 
from here.

char frontUnit(string input) {
	assert(input.length > 0);
	return input[0];
}
wchar frontUnit(wstring input) {
	assert(input.length > 0);
	return input[0];
}
dchar frontUnit(dstring input) {
	assert(input.length > 0);
	return input[0];
}

void popFrontUnit(ref string input) {
	assert(input.length > 0);
	input = input[1..$];
}
void popFrontUnit(ref wstring input) {
	assert(input.length > 0);
	input = input[1..$];
}
void popFrontUnit(ref dstring input) {
	assert(input.length > 0);
	input = input[1..$];
}

version (unittest) {
	import std.string : front, popFront;
}

unittest {
	string test = "été";
	assert(test.length == 5);

	string test2 = test;
	assert(test2.front == 'é');
	test2.popFront();
	assert(test2.length == 3); // removed "é" which is two UTF-8 code units

	string test3 = test;
	assert(test3.frontUnit == "é"c[0]);
	test3.popFrontUnit();
	assert(test3.length == 4); // removed first half of "é" which, one 
UTF-8 code units
}

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/