VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 12:31:23 PST 2011

On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" 
<schveiguy at yahoo.com> said:

> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
> <lutger.blijdestijn at gmail.com> wrote:
> 
>> Steven Schveighoffer wrote:
>> 
>> ...
>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>> how easy it is to do things the right way. In the above, foreach would
>>>> slice the string grapheme by grapheme, and the == operator would  perform
>>>> a normalized comparison. While it works correctly, it's probably not  the
>>>> most efficient way to do thing however.
>>> 
>>> I think this is a good alternative, but I'd rather not impose this on
>>> people like myself who deal mostly with English.  I think this should be
>>> possible to do with wrapper types or intermediate ranges which have
>>> graphemes as elements (per my suggestion above).
>>> 
>>> Does this sound reasonable?
>>> 
>>> -Steve
>> 
>> If its a matter of choosing which is the 'default' range, I'd think  proper
>> unicode handling is more reasonable than catering for english / ascii  only.
>> Especially since this is already the case in phobos string algorithms.
> 
> English and (if I understand correctly) most other languages.  Any  
> language which can be built from composable graphemes would work.  And 
> in  fact, ones that use some graphemes that cannot be composed will 
> also work  to some degree (for example, opEquals).
> 
> What I'm proposing (or think I'm proposing) is not exactly catering to  
> English and ASCII, what I'm proposing is simply not catering to more  
> complex languages such as Hebrew and Arabic.  What I'm trying to find 
> is a  middle ground where most languages work, and the code is simple 
> and  efficient, with possibilities to jump down to lower levels for 
> performance  (i.e. switch to char[] when you know ASCII is all you are 
> using) or jump  up to full unicode when necessary.

Why don't we build a compiler with an optimizer that generates correct 
code *almost* all of the time? If you are worried about it not 
producing correct code for a given function, you can just add 
"pragma(correct_code)" in front of that function to disable the risky 
optimizations. No harm done, right?

One thing I see very often, often on US web sites but also elsewhere, 
is that if you enter a name with an accented letter in a form (say 
Émilie), very often the accented letter gets changed to another 
semi-random character later in the process. Why? Because somewhere in 
the process lies an encoding mismatch that no one thought about and no 
one tested for. At the very least, the form should have rejected those 
unexpected characters and show an error when it could.

Now, with proper Unicode handling up to the code point level, this kind 
of problem probably won't happen as often because the whole stack works 
with UTF encodings. But are you going to validate all of your inputs to 
make sure they have no combining code point?

Don't assume that because you're in the United States no one will try 
to enter characters where you don't expect them. People love to play 
with Unicode symbols for fun, putting them in their name, signature, or 
even domain names (✪df.ws). Just wait until they discover they can 
combine them. ☺̰̎! There is also a variety of combining mathematical 
symbols with no pre-combined form, such as ≸. Writing in Arabic, 
Hebrew, Korean, or some other foreign language isn't a prerequisite to 
use combining characters.

> Essentially, we would have three levels of types:
> 
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that 
> do  normalization to dchars, but do not handle perfectly all graphemes. 
>  Works  with any algorithm that deals with bidirectional ranges.  This 
> is the  default string type, and the type for string literals.  
> Represented  internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which 
> may  perform worse than string_t, but supports everything unicode 
> supports.   May require a battery of specialized algorithms.
> 
> * - name up for discussion
> 
> Also note that phobos currently does *no* normalization as far as I can 
>  tell for things like opEquals.  Two char[]'s that represent equivalent 
>  strings, but not in the same way, will compare as !=.

Basically, you're suggesting that the default way should be to handle 
Unicode *almost* right. And then, if you want to handle thing *really* 
right you need to be explicit about it by using "utfstring_t"? I 
understand your motivation, but it sounds backward to me.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/