VLERange: a range in between BidirectionalRange and RandomAccessRange
Steven Schveighoffer
schveiguy at yahoo.com
Sat Jan 15 12:20:08 PST 2011
On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin
<michel.fortin at michelf.com> wrote:
> On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer"
> <schveiguy at yahoo.com> said:
>
>> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin
>> <michel.fortin at michelf.com> wrote:
>>
>>> Actually, returning a sliced char[] or wchar[] could also be valid.
>>> User-perceived characters are basically a substring of one or more
>>> code points. I'm not sure it complicates that much the semantics of
>>> the language -- what's complicated about writing str.front == "a"
>>> instead of str.front == 'a'? -- although it probably would complicate
>>> the generated code and make it a little slower.
>> Hm... this pushes the normalization outside the type, and into the
>> algorithms (such as find).
>> I was hoping to avoid that.
>
> Not really. It pushes the normalization to the string comparison
> operator, as explained later.
>
>
>> I think I can come up with an algorithm that normalizes into canonical
>> form as it iterates. It just might return part of a grapheme if the
>> grapheme cannot be composed.
>
> The problem with normalization while iterating is that you lose
> information about what the actual code points part of the grapheme. If
> you wanted to count the number of grapheme with a particular code point
> you're lost that information.
Are these common requirements? I thought users mostly care about
graphemes, not code points. Asking in the dark here, since I have next to
zero experience with unicode strings.
>
> Moreover, if all you want is to count the number of grapheme,
> normalizing the character is a waste of time.
This is true. I can see this being a common need.
>
> I suggested in another post that we implement ranges for decomposing and
> recomposing on-the-fly a string in its normalized form. That's basically
> the same thing as you suggest, but it'd have to be explicit to avoid the
> problem above.
OK, I see your point.
>
>
>>> I wonder if normalized string comparison shouldn't be built directly
>>> in the char[] wchar[] and dchar[] types instead.
>> No, in my vision of how strings should be typed, char[] is an array,
>> not a string. It should be treated like an array of code-units, where
>> two forms that create the same grapheme are considered different.
>
> Well, I agree there's a need for that sometime. But if what you want is
> just a dumb array of code units, why not use ubyte[], ushort[] and
> uint[] instead?
Because ubyte[] ushort[] and uint[] do not say that their data is unicode
text. The point is, I want to write a function that takes utf-8, ubyte[]
opens it up to any data, not just UTF-8 data. But if we have a method of
iterating code-units as you specify below, then I think we are OK.
> It seems to me that the whole point of having a different type for
> char[], wchar[], and dchar[] is that you know they are Unicode strings
> and can treat them as such. And if you treat them as Unicode strings,
> then perhaps the runtime and the compiler should too, for consistency's
> sake.
I'd agree with you, but then there's that pesky [] after it indicating
it's an array. For consistency's sake, I'd say the compiler should treat
T[] as an array of T's.
>>> Also bring the idea above that iterating on a string would yield
>>> graphemes as char[] and this code would work perfectly irrespective
>>> of whether you used combining characters:
>>> foreach (grapheme; "exposé") {
>>> if (grapheme == "é")
>>> break;
>>> }
>>> I think a good standard to evaluate our handling of Unicode is to
>>> see how easy it is to do things the right way. In the above, foreach
>>> would slice the string grapheme by grapheme, and the == operator
>>> would perform a normalized comparison. While it works correctly, it's
>>> probably not the most efficient way to do thing however.
>> I think this is a good alternative, but I'd rather not impose this on
>> people like myself who deal mostly with English.
>
> I'm not suggesting we impose it, just that we make it the default. If
> you want to iterate by dchar, wchar, or char, just write:
>
> foreach (dchar c; "exposé") {}
> foreach (wchar c; "exposé") {}
> foreach (char c; "exposé") {}
> // or
> foreach (dchar c; "exposé".by!dchar()) {}
> foreach (wchar c; "exposé".by!wchar()) {}
> foreach (char c; "exposé".by!char()) {}
>
> and it'll work. But the default would be a slice containing the
> grapheme, because this is the right way to represent a Unicode character.
I think this is a good idea. I previously was nervous about it, but I'm
not sure it makes a huge difference. Returning a char[] is certainly less
work than normalizing a grapheme into one or more code points, and then
returning them. All that it takes is to detect all the code points within
the grapheme. Normalization can be done if needed, but would probably
have to output another char[], since a normalized grapheme can occupy more
than one dchar.
What if I modified my proposed string_t type to return T[] as its element
type, as you say, and string literals are typed as string_t!(whatever)?
In addition, the restrictions I imposed on slicing a code point actually
get imposed on slicing a grapheme. That is, it is illegal to substring a
string_t in a way that slices through a grapheme (and by deduction, a code
point)?
Actually, we would need a grapheme to be its own type, because comparing
two char[]'s that don't contain equivalent bits and having them be equal,
violates the expectation that char[] is an array.
So the string_t!char would return a grapheme_t!char (names to be
discussed) as its element type.
>
>
>> I think this should be possible to do with wrapper types or
>> intermediate ranges which have graphemes as elements (per my
>> suggestion above).
>
> I think it should be the reverse. If you want your code to break when it
> encounters multi-code-point graphemes then it's your choice, but you
> should have to make your choice explicit. The default should be to
> handle strings correctly.
You are probably right.
-Steve
More information about the Digitalmars-d
mailing list