VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 12:35:21 PST 2011

On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn   
>> <lutger.blijdestijn at gmail.com> wrote:
>>
>>> Steven Schveighoffer wrote:
>>>  ...
>>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>>> how easy it is to do things the right way. In the above, foreach  
>>>>> would
>>>>> slice the string grapheme by grapheme, and the == operator would   
>>>>> perform
>>>>> a normalized comparison. While it works correctly, it's probably  
>>>>> not  the
>>>>> most efficient way to do thing however.
>>>>  I think this is a good alternative, but I'd rather not impose this on
>>>> people like myself who deal mostly with English.  I think this should  
>>>> be
>>>> possible to do with wrapper types or intermediate ranges which have
>>>> graphemes as elements (per my suggestion above).
>>>>  Does this sound reasonable?
>>>>  -Steve
>>>  If its a matter of choosing which is the 'default' range, I'd think   
>>> proper
>>> unicode handling is more reasonable than catering for english / ascii   
>>> only.
>>> Especially since this is already the case in phobos string algorithms.
>>  English and (if I understand correctly) most other languages.  Any   
>> language which can be built from composable graphemes would work.  And  
>> in  fact, ones that use some graphemes that cannot be composed will  
>> also work  to some degree (for example, opEquals).
>>  What I'm proposing (or think I'm proposing) is not exactly catering  
>> to  English and ASCII, what I'm proposing is simply not catering to  
>> more  complex languages such as Hebrew and Arabic.  What I'm trying to  
>> find is a  middle ground where most languages work, and the code is  
>> simple and  efficient, with possibilities to jump down to lower levels  
>> for performance  (i.e. switch to char[] when you know ASCII is all you  
>> are using) or jump  up to full unicode when necessary.
>
> Why don't we build a compiler with an optimizer that generates correct  
> code *almost* all of the time? If you are worried about it not producing  
> correct code for a given function, you can just add  
> "pragma(correct_code)" in front of that function to disable the risky  
> optimizations. No harm done, right?
>
> One thing I see very often, often on US web sites but also elsewhere, is  
> that if you enter a name with an accented letter in a form (say Émilie),  
> very often the accented letter gets changed to another semi-random  
> character later in the process. Why? Because somewhere in the process  
> lies an encoding mismatch that no one thought about and no one tested  
> for. At the very least, the form should have rejected those unexpected  
> characters and show an error when it could.
>
> Now, with proper Unicode handling up to the code point level, this kind  
> of problem probably won't happen as often because the whole stack works  
> with UTF encodings. But are you going to validate all of your inputs to  
> make sure they have no combining code point?
>
> Don't assume that because you're in the United States no one will try to  
> enter characters where you don't expect them. People love to play with  
> Unicode symbols for fun, putting them in their name, signature, or even  
> domain names (✪df.ws). Just wait until they discover they can combine  
> them. ☺̰̎! There is also a variety of combining mathematical symbols  
> with no pre-combined form, such as ≸. Writing in Arabic, Hebrew,  
> Korean, or some other foreign language isn't a prerequisite to use  
> combining characters.
>
>
>> Essentially, we would have three levels of types:
>>  char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that  
>> do  normalization to dchars, but do not handle perfectly all graphemes.  
>>  Works  with any algorithm that deals with bidirectional ranges.  This  
>> is the  default string type, and the type for string literals.   
>> Represented  internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which  
>> may  perform worse than string_t, but supports everything unicode  
>> supports.   May require a battery of specialized algorithms.
>>  * - name up for discussion
>>  Also note that phobos currently does *no* normalization as far as I  
>> can  tell for things like opEquals.  Two char[]'s that represent  
>> equivalent  strings, but not in the same way, will compare as !=.
>
> Basically, you're suggesting that the default way should be to handle  
> Unicode *almost* right. And then, if you want to handle thing *really*  
> right you need to be explicit about it by using "utfstring_t"? I  
> understand your motivation, but it sounds backward to me.

You make very good points.  I concede that using dchar as the element  
point is not correct for unicode strings.

-Steve