VLERange: a range in between BidirectionalRange and RandomAccessRange
Steven Schveighoffer
schveiguy at yahoo.com
Sat Jan 15 12:35:21 PST 2011
On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin
<michel.fortin at michelf.com> wrote:
> On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer"
> <schveiguy at yahoo.com> said:
>
>> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn
>> <lutger.blijdestijn at gmail.com> wrote:
>>
>>> Steven Schveighoffer wrote:
>>> ...
>>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>>> how easy it is to do things the right way. In the above, foreach
>>>>> would
>>>>> slice the string grapheme by grapheme, and the == operator would
>>>>> perform
>>>>> a normalized comparison. While it works correctly, it's probably
>>>>> not the
>>>>> most efficient way to do thing however.
>>>> I think this is a good alternative, but I'd rather not impose this on
>>>> people like myself who deal mostly with English. I think this should
>>>> be
>>>> possible to do with wrapper types or intermediate ranges which have
>>>> graphemes as elements (per my suggestion above).
>>>> Does this sound reasonable?
>>>> -Steve
>>> If its a matter of choosing which is the 'default' range, I'd think
>>> proper
>>> unicode handling is more reasonable than catering for english / ascii
>>> only.
>>> Especially since this is already the case in phobos string algorithms.
>> English and (if I understand correctly) most other languages. Any
>> language which can be built from composable graphemes would work. And
>> in fact, ones that use some graphemes that cannot be composed will
>> also work to some degree (for example, opEquals).
>> What I'm proposing (or think I'm proposing) is not exactly catering
>> to English and ASCII, what I'm proposing is simply not catering to
>> more complex languages such as Hebrew and Arabic. What I'm trying to
>> find is a middle ground where most languages work, and the code is
>> simple and efficient, with possibilities to jump down to lower levels
>> for performance (i.e. switch to char[] when you know ASCII is all you
>> are using) or jump up to full unicode when necessary.
>
> Why don't we build a compiler with an optimizer that generates correct
> code *almost* all of the time? If you are worried about it not producing
> correct code for a given function, you can just add
> "pragma(correct_code)" in front of that function to disable the risky
> optimizations. No harm done, right?
>
> One thing I see very often, often on US web sites but also elsewhere, is
> that if you enter a name with an accented letter in a form (say Émilie),
> very often the accented letter gets changed to another semi-random
> character later in the process. Why? Because somewhere in the process
> lies an encoding mismatch that no one thought about and no one tested
> for. At the very least, the form should have rejected those unexpected
> characters and show an error when it could.
>
> Now, with proper Unicode handling up to the code point level, this kind
> of problem probably won't happen as often because the whole stack works
> with UTF encodings. But are you going to validate all of your inputs to
> make sure they have no combining code point?
>
> Don't assume that because you're in the United States no one will try to
> enter characters where you don't expect them. People love to play with
> Unicode symbols for fun, putting them in their name, signature, or even
> domain names (✪df.ws). Just wait until they discover they can combine
> them. ☺̰̎! There is also a variety of combining mathematical symbols
> with no pre-combined form, such as ≸. Writing in Arabic, Hebrew,
> Korean, or some other foreign language isn't a prerequisite to use
> combining characters.
>
>
>> Essentially, we would have three levels of types:
>> char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that
>> do normalization to dchars, but do not handle perfectly all graphemes.
>> Works with any algorithm that deals with bidirectional ranges. This
>> is the default string type, and the type for string literals.
>> Represented internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which
>> may perform worse than string_t, but supports everything unicode
>> supports. May require a battery of specialized algorithms.
>> * - name up for discussion
>> Also note that phobos currently does *no* normalization as far as I
>> can tell for things like opEquals. Two char[]'s that represent
>> equivalent strings, but not in the same way, will compare as !=.
>
> Basically, you're suggesting that the default way should be to handle
> Unicode *almost* right. And then, if you want to handle thing *really*
> right you need to be explicit about it by using "utfstring_t"? I
> understand your motivation, but it sounds backward to me.
You make very good points. I concede that using dchar as the element
point is not correct for unicode strings.
-Steve
More information about the Digitalmars-d
mailing list