VLERange: a range in between BidirectionalRange and RandomAccessRange
foobar
foo at bar.com
Sat Jan 15 07:59:52 PST 2011
Michel Fortin Wrote:
> On 2011-01-15 09:09:17 -0500, foobar <foo at bar.com> said:
>
> > Lutger Blijdestijn Wrote:
> >
> >> Michel Fortin wrote:
> >>
> >>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
> >>> <lutger.blijdestijn at gmail.com> said:
> >> ...
> >>>>
> >>>> Is it still possible to solve this problem or are we stuck with
> >>>> specialized string algorithms? Would it work if VleRange of string was a
> >>>> bidirectional range with string slices of graphemes as the ElementType
> >>>> and indexing with code units? Often used string algorithms could be
> >>>> specialized for performance, but if not, generic algorithms would still
> >>>> work.
> >>>
> >>> I have my idea.
> >>>
> >>> I think it'd be a good idea is to improve upon Andrei's first idea --
> >>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
> >>> elements -- by changing the element type to be the same as the string.
> >>> For instance, iterating on a char[] would give you slices of char[],
> >>> each having one grapheme.
> >>>
> >> ...
> >>
> >> Yes, this is exactly what I meant, but you are much clearer. I hope this can
> >> be made to work!
> >>
> >
> > My two cents are against this kind of design.
> > The "correct" approach IMO is a 'universal text' type which is a
> > _container_ of said text. This type would provide ranges for the
> > various abstraction levels. E.g.
> > text.codeUnits to iterate by codeUnits
>
> Nothing prevents that in the design I proposed. Andrei's design already
> implements "str".byDchar() that would work for code points. I'd suggest
> changing the API to by!char(), by!wchar(), and by!cdhar() for when you
> deal with whatever kind of code unit or code point you want. This would
> be mostly symmetric to what you can already do with foreach:
>
> foreach (char c; "hello") {}
> foreach (wchar c; "hello") {}
> foreach (dchar c; "hello") {}
> // same as:
> foreach (c; "hello".by!char()) {}
> foreach (c; "hello".by!wchar()) {}
> foreach (c; "hello".by!dchar()) {}
>
>
> > Here's a (perhaps contrived) example:
> > Let's say I want to find the combining marks in some text.
> >
> > For instance, Hebrew uses combining marks for vowels (among other
> > things) and they are optional in the language (There's a "full" form
> > with vowels and a "missing" form without them).
> > I have a Hebrew text with in the "full" form and I want to strip it and
> > convert it to the "missing" form.
> >
> > How would I accomplish this with your design?
>
> All you need is a range that takes a string as input and give you code
> points in a decomposed form (NFD), then you use std.algorithm.filter on
> it:
>
> // original string
> auto str = "...";
>
> // create normalized decomposed string as a lazy range of dchar (NFD)
> auto decomposed = decompose(str);
>
> // filter to remove your favorite combining code point (use the hex
> code you want)
> auto filtered = filter!"a != 0xFABA"(decomposed);
>
> // turn it back in composed form (NFC), optional
> auto recomposed = compose(filtered);
>
> // convert back to a string (could also be wstring or dstring)
> string result = array(recomposed.by!char());
>
> This last line is the one doing everything. All the rest just chain
> ranges together for doing on-the-fly decomposition, filtering, and
> recomposition; the last line uses that chain of rage to fill the array.
>
> A more naive implementation not taking advantage of code points but
> instead using a replacement table would also work:
>
> string str = "...";
> string result;
> string[string] replacements = ["é":"e"]; // change this for what you want
> foreach (grapheme; str) {
> auto replacement = grapheme in replacements;
> if (replacement)
> result ~= replacement;
> else
> result ~= grapheme;
> }
>
>
> --
> Michel Fortin
> michel.fortin at michelf.com
> http://michelf.com/
>
Ok, I guess I missed the "byDchar()" method.
I envisioned the same algorithm looking like this:
// original string
string str = "...";
// create normalized decomposed string as a lazy range of dchar (NFD)
// Note: explicitly specify code points range:
auto decomposed = decompose(str.codePoints);
// filter to remove your favorite combining code point
auto filtered = filter!"a != 0xFABA"(decomposed);
// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);
// convert back to a string
// Note: a string type can be constructed from a range of code points
string result = string(recomposed);
The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?
More information about the Digitalmars-d
mailing list