VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin
michel.fortin at michelf.com
Sat Jan 15 10:32:10 PST 2011
On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer"
<schveiguy at yahoo.com> said:
> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin
> <michel.fortin at michelf.com> wrote:
>
>> Actually, returning a sliced char[] or wchar[] could also be valid.
>> User-perceived characters are basically a substring of one or more code
>> points. I'm not sure it complicates that much the semantics of the
>> language -- what's complicated about writing str.front == "a" instead
>> of str.front == 'a'? -- although it probably would complicate the
>> generated code and make it a little slower.
>
> Hm... this pushes the normalization outside the type, and into the
> algorithms (such as find).
>
> I was hoping to avoid that.
Not really. It pushes the normalization to the string comparison
operator, as explained later.
> I think I can come up with an algorithm that normalizes into canonical
> form as it iterates. It just might return part of a grapheme if the
> grapheme cannot be composed.
The problem with normalization while iterating is that you lose
information about what the actual code points part of the grapheme. If
you wanted to count the number of grapheme with a particular code point
you're lost that information.
Moreover, if all you want is to count the number of grapheme,
normalizing the character is a waste of time.
I suggested in another post that we implement ranges for decomposing
and recomposing on-the-fly a string in its normalized form. That's
basically the same thing as you suggest, but it'd have to be explicit
to avoid the problem above.
>> I wonder if normalized string comparison shouldn't be built directly in
>> the char[] wchar[] and dchar[] types instead.
>
> No, in my vision of how strings should be typed, char[] is an array,
> not a string. It should be treated like an array of code-units, where
> two forms that create the same grapheme are considered different.
Well, I agree there's a need for that sometime. But if what you want is
just a dumb array of code units, why not use ubyte[], ushort[] and
uint[] instead?
It seems to me that the whole point of having a different type for
char[], wchar[], and dchar[] is that you know they are Unicode strings
and can treat them as such. And if you treat them as Unicode strings,
then perhaps the runtime and the compiler should too, for consistency's
sake.
>> Also bring the idea above that iterating on a string would yield
>> graphemes as char[] and this code would work perfectly irrespective of
>> whether you used combining characters:
>>
>> foreach (grapheme; "exposé") {
>> if (grapheme == "é")
>> break;
>> }
>>
>> I think a good standard to evaluate our handling of Unicode is to see
>> how easy it is to do things the right way. In the above, foreach would
>> slice the string grapheme by grapheme, and the == operator would
>> perform a normalized comparison. While it works correctly, it's
>> probably not the most efficient way to do thing however.
>
> I think this is a good alternative, but I'd rather not impose this on
> people like myself who deal mostly with English.
I'm not suggesting we impose it, just that we make it the default. If
you want to iterate by dchar, wchar, or char, just write:
foreach (dchar c; "exposé") {}
foreach (wchar c; "exposé") {}
foreach (char c; "exposé") {}
// or
foreach (dchar c; "exposé".by!dchar()) {}
foreach (wchar c; "exposé".by!wchar()) {}
foreach (char c; "exposé".by!char()) {}
and it'll work. But the default would be a slice containing the
grapheme, because this is the right way to represent a Unicode
character.
> I think this should be possible to do with wrapper types or
> intermediate ranges which have graphemes as elements (per my
> suggestion above).
I think it should be the reverse. If you want your code to break when
it encounters multi-code-point graphemes then it's your choice, but you
should have to make your choice explicit. The default should be to
handle strings correctly.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list