VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 07:59:52 PST 2011

Michel Fortin Wrote:

> On 2011-01-15 09:09:17 -0500, foobar <foo at bar.com> said:
> 
> > Lutger Blijdestijn Wrote:
> > 
> >> Michel Fortin wrote:
> >> 
> >>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
> >>> <lutger.blijdestijn at gmail.com> said:
> >> ...
> >>>> 
> >>>> Is it still possible to solve this problem or are we stuck with
> >>>> specialized string algorithms? Would it work if VleRange of string was a
> >>>> bidirectional range with string slices of graphemes as the ElementType
> >>>> and indexing with code units? Often used string algorithms could be
> >>>> specialized for performance, but if not, generic algorithms would still
> >>>> work.
> >>> 
> >>> I have my idea.
> >>> 
> >>> I think it'd be a good idea is to improve upon Andrei's first idea --
> >>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
> >>> elements -- by changing the element type to be the same as the string.
> >>> For instance, iterating on a char[] would give you slices of char[],
> >>> each having one grapheme.
> >>> 
> >> ...
> >> 
> >> Yes, this is exactly what I meant, but you are much clearer. I hope this can
> >> be made to work!
> >> 
> > 
> > My two cents are against this kind of design.
> > The "correct" approach IMO is a 'universal text' type which is a 
> > _container_ of said text. This type would provide ranges for the 
> > various abstraction levels. E.g.
> > text.codeUnits to iterate by codeUnits
> 
> Nothing prevents that in the design I proposed. Andrei's design already 
> implements "str".byDchar() that would work for code points. I'd suggest 
> changing the API to by!char(), by!wchar(), and by!cdhar() for when you 
> deal with whatever kind of code unit or code point you want. This would 
> be mostly symmetric to what you can already do with foreach:
> 
> 	foreach (char c; "hello") {}
> 	foreach (wchar c; "hello") {}
> 	foreach (dchar c; "hello") {}
> // same as:
> 	foreach (c; "hello".by!char()) {}
> 	foreach (c; "hello".by!wchar()) {}
> 	foreach (c; "hello".by!dchar()) {}
> 
> 
> > Here's a (perhaps contrived) example:
> > Let's say I want to find the combining marks in some text.
> > 
> > For instance, Hebrew uses combining marks for vowels (among other 
> > things) and they are optional in the language (There's a "full" form 
> > with vowels and a "missing" form without them).
> > I have a Hebrew text with in the "full" form and I want to strip it and 
> > convert it to the "missing" form.
> > 
> > How would I accomplish this with your design?
> 
> All you need is a range that takes a string as input and give you code 
> points in a decomposed form (NFD), then you use std.algorithm.filter on 
> it:
> 
> 	// original string
> 	auto str = "...";
> 
> 	// create normalized decomposed string as a lazy range of dchar (NFD)
> 	auto decomposed = decompose(str);
> 
> 	// filter to remove your favorite combining code point (use the hex 
> code you want)
> 	auto filtered = filter!"a != 0xFABA"(decomposed);
> 
> 	// turn it back in composed form (NFC), optional
> 	auto recomposed = compose(filtered);
> 
> 	// convert back to a string (could also be wstring or dstring)
> 	string result = array(recomposed.by!char());
> 
> This last line is the one doing everything. All the rest just chain 
> ranges together for doing on-the-fly decomposition, filtering, and 
> recomposition; the last line uses that chain of rage to fill the array.
> 
> A more naive implementation not taking advantage of code points but 
> instead using a replacement table would also work:
> 
> 	string str = "...";
> 	string result;
> 	string[string] replacements = ["é":"e"]; // change this for what you want
> 	foreach (grapheme; str) {
> 		auto replacement = grapheme in replacements;
> 		if (replacement)
> 			result ~= replacement;
> 		else
> 			result ~= grapheme;
> 	}
> 	
> 
> -- 
> Michel Fortin
> michel.fortin at michelf.com
> http://michelf.com/
> 

Ok, I guess I missed the "byDchar()" method. 
I envisioned the same algorithm looking like this:

// original string
string str = "...";

// create normalized decomposed string as a lazy range of dchar (NFD)
// Note: explicitly specify code points range:
auto decomposed = decompose(str.codePoints);

// filter to remove your favorite combining code point
auto filtered = filter!"a != 0xFABA"(decomposed);

// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);

// convert back to a string
// Note: a string type can be constructed from a range of code points
string result = string(recomposed);

The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?