VLERange: a range in between BidirectionalRange and RandomAccessRange
Steven Schveighoffer
schveiguy at yahoo.com
Sat Jan 15 08:59:04 PST 2011
On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin
<michel.fortin at michelf.com> wrote:
> On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"
> <schveiguy at yahoo.com> said:
>
>> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir at gmail.com> wrote:
>>
>>> The point is not playing like that with Unicode flexibility. Rather
>>> that composite characters are just normal thingies in most languages
>>> of the world. Actually, on this point, english is a rare exception
>>> (discarding letters imported from foreign languages like french 'à');
>>> to the point of beeing, I guess, the only western language without
>>> any diacritic.
>> Is it common to have multiple modifiers on a single character?
>
> Not in my knowledge. But I rarely deal with non-latin texts, there's
> probably some scripts out there that takes advantage of this.
>
>
>> The problem I see with using decomposed canonical form for strings is
>> that we would have to return a dchar[] for each 'element', which
>> severely complicates code that, for instance, only expects to handle
>> English.
>
> Actually, returning a sliced char[] or wchar[] could also be valid.
> User-perceived characters are basically a substring of one or more code
> points. I'm not sure it complicates that much the semantics of the
> language -- what's complicated about writing str.front == "a" instead of
> str.front == 'a'? -- although it probably would complicate the generated
> code and make it a little slower.
Hm... this pushes the normalization outside the type, and into the
algorithms (such as find). I was hoping to avoid that. I think I can
come up with an algorithm that normalizes into canonical form as it
iterates. It just might return part of a grapheme if the grapheme cannot
be composed.
I do think that we could make a byGrapheme member to aid in this:
foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains
one composed grapheme.
>
> In the case of NSString in Cocoa, you can only access the 'characters'
> in their UTF-16 form. But everything from comparison to search for
> substring is done using graphemes. It's like they implemented
> specialized Unicode-aware algorithms for these functions. There's no
> genericness about how it handles graphemes.
>
> I'm not sure yet about what would be the right approach for D.
I hope we can use generic versions, so the type itself handles the
conversions. That makes any algorithm using the string range correct.
>> I was hoping to lazily transform a string into its composed canonical
>> form, allowing the (hopefully rare) exception when a composed character
>> does not exist. My thinking was that this at least gives a useful
>> string representation for 90% of usages, leaving the remaining 10% of
>> usages to find a more complex representation (like your Text type).
>> If we only get like 20% or 30% there by making dchar the element type,
>> then we haven't made it useful enough.
>> Either way, we need a string type that can be compared canonically
>> for things like searches or opEquals.
>
> I wonder if normalized string comparison shouldn't be built directly in
> the char[] wchar[] and dchar[] types instead.
No, in my vision of how strings should be typed, char[] is an array, not a
string. It should be treated like an array of code-units, where two forms
that create the same grapheme are considered different.
> Also bring the idea above that iterating on a string would yield
> graphemes as char[] and this code would work perfectly irrespective of
> whether you used combining characters:
>
> foreach (grapheme; "exposé") {
> if (grapheme == "é")
> break;
> }
>
> I think a good standard to evaluate our handling of Unicode is to see
> how easy it is to do things the right way. In the above, foreach would
> slice the string grapheme by grapheme, and the == operator would perform
> a normalized comparison. While it works correctly, it's probably not the
> most efficient way to do thing however.
I think this is a good alternative, but I'd rather not impose this on
people like myself who deal mostly with English. I think this should be
possible to do with wrapper types or intermediate ranges which have
graphemes as elements (per my suggestion above).
Does this sound reasonable?
-Steve
More information about the Digitalmars-d
mailing list