VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin
michel.fortin at michelf.com
Sat Jan 15 07:24:42 PST 2011
On 2011-01-15 09:09:17 -0500, foobar <foo at bar.com> said:
> Lutger Blijdestijn Wrote:
>
>> Michel Fortin wrote:
>>
>>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
>>> <lutger.blijdestijn at gmail.com> said:
>> ...
>>>>
>>>> Is it still possible to solve this problem or are we stuck with
>>>> specialized string algorithms? Would it work if VleRange of string was a
>>>> bidirectional range with string slices of graphemes as the ElementType
>>>> and indexing with code units? Often used string algorithms could be
>>>> specialized for performance, but if not, generic algorithms would still
>>>> work.
>>>
>>> I have my idea.
>>>
>>> I think it'd be a good idea is to improve upon Andrei's first idea --
>>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
>>> elements -- by changing the element type to be the same as the string.
>>> For instance, iterating on a char[] would give you slices of char[],
>>> each having one grapheme.
>>>
>> ...
>>
>> Yes, this is exactly what I meant, but you are much clearer. I hope this can
>> be made to work!
>>
>
> My two cents are against this kind of design.
> The "correct" approach IMO is a 'universal text' type which is a
> _container_ of said text. This type would provide ranges for the
> various abstraction levels. E.g.
> text.codeUnits to iterate by codeUnits
Nothing prevents that in the design I proposed. Andrei's design already
implements "str".byDchar() that would work for code points. I'd suggest
changing the API to by!char(), by!wchar(), and by!cdhar() for when you
deal with whatever kind of code unit or code point you want. This would
be mostly symmetric to what you can already do with foreach:
foreach (char c; "hello") {}
foreach (wchar c; "hello") {}
foreach (dchar c; "hello") {}
// same as:
foreach (c; "hello".by!char()) {}
foreach (c; "hello".by!wchar()) {}
foreach (c; "hello".by!dchar()) {}
> Here's a (perhaps contrived) example:
> Let's say I want to find the combining marks in some text.
>
> For instance, Hebrew uses combining marks for vowels (among other
> things) and they are optional in the language (There's a "full" form
> with vowels and a "missing" form without them).
> I have a Hebrew text with in the "full" form and I want to strip it and
> convert it to the "missing" form.
>
> How would I accomplish this with your design?
All you need is a range that takes a string as input and give you code
points in a decomposed form (NFD), then you use std.algorithm.filter on
it:
// original string
auto str = "...";
// create normalized decomposed string as a lazy range of dchar (NFD)
auto decomposed = decompose(str);
// filter to remove your favorite combining code point (use the hex
code you want)
auto filtered = filter!"a != 0xFABA"(decomposed);
// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);
// convert back to a string (could also be wstring or dstring)
string result = array(recomposed.by!char());
This last line is the one doing everything. All the rest just chain
ranges together for doing on-the-fly decomposition, filtering, and
recomposition; the last line uses that chain of rage to fill the array.
A more naive implementation not taking advantage of code points but
instead using a replacement table would also work:
string str = "...";
string result;
string[string] replacements = ["é":"e"]; // change this for what you want
foreach (grapheme; str) {
auto replacement = grapheme in replacements;
if (replacement)
result ~= replacement;
else
result ~= grapheme;
}
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list