VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 07:24:42 PST 2011

On 2011-01-15 09:09:17 -0500, foobar <foo at bar.com> said:

> Lutger Blijdestijn Wrote:
> 
>> Michel Fortin wrote:
>> 
>>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
>>> <lutger.blijdestijn at gmail.com> said:
>> ...
>>>> 
>>>> Is it still possible to solve this problem or are we stuck with
>>>> specialized string algorithms? Would it work if VleRange of string was a
>>>> bidirectional range with string slices of graphemes as the ElementType
>>>> and indexing with code units? Often used string algorithms could be
>>>> specialized for performance, but if not, generic algorithms would still
>>>> work.
>>> 
>>> I have my idea.
>>> 
>>> I think it'd be a good idea is to improve upon Andrei's first idea --
>>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
>>> elements -- by changing the element type to be the same as the string.
>>> For instance, iterating on a char[] would give you slices of char[],
>>> each having one grapheme.
>>> 
>> ...
>> 
>> Yes, this is exactly what I meant, but you are much clearer. I hope this can
>> be made to work!
>> 
> 
> My two cents are against this kind of design.
> The "correct" approach IMO is a 'universal text' type which is a 
> _container_ of said text. This type would provide ranges for the 
> various abstraction levels. E.g.
> text.codeUnits to iterate by codeUnits

Nothing prevents that in the design I proposed. Andrei's design already 
implements "str".byDchar() that would work for code points. I'd suggest 
changing the API to by!char(), by!wchar(), and by!cdhar() for when you 
deal with whatever kind of code unit or code point you want. This would 
be mostly symmetric to what you can already do with foreach:

	foreach (char c; "hello") {}
	foreach (wchar c; "hello") {}
	foreach (dchar c; "hello") {}
// same as:
	foreach (c; "hello".by!char()) {}
	foreach (c; "hello".by!wchar()) {}
	foreach (c; "hello".by!dchar()) {}

> Here's a (perhaps contrived) example:
> Let's say I want to find the combining marks in some text.
> 
> For instance, Hebrew uses combining marks for vowels (among other 
> things) and they are optional in the language (There's a "full" form 
> with vowels and a "missing" form without them).
> I have a Hebrew text with in the "full" form and I want to strip it and 
> convert it to the "missing" form.
> 
> How would I accomplish this with your design?

All you need is a range that takes a string as input and give you code 
points in a decomposed form (NFD), then you use std.algorithm.filter on 
it:

	// original string
	auto str = "...";

	// create normalized decomposed string as a lazy range of dchar (NFD)
	auto decomposed = decompose(str);

	// filter to remove your favorite combining code point (use the hex 
code you want)
	auto filtered = filter!"a != 0xFABA"(decomposed);

	// turn it back in composed form (NFC), optional
	auto recomposed = compose(filtered);

	// convert back to a string (could also be wstring or dstring)
	string result = array(recomposed.by!char());

This last line is the one doing everything. All the rest just chain 
ranges together for doing on-the-fly decomposition, filtering, and 
recomposition; the last line uses that chain of rage to fill the array.

A more naive implementation not taking advantage of code points but 
instead using a replacement table would also work:

	string str = "...";
	string result;
	string[string] replacements = ["é":"e"]; // change this for what you want
	foreach (grapheme; str) {
		auto replacement = grapheme in replacements;
		if (replacement)
			result ~= replacement;
		else
			result ~= grapheme;
	}

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/