VLERange: a range in between BidirectionalRange and

Sat Jan 15 11:51:47 PST 2011

On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo at bar.com> wrote:

> Steven Schveighoffer Wrote:

>>
>> English and (if I understand correctly) most other languages.  Any
>> language which can be built from composable graphemes would work.  And  
>> in
>> fact, ones that use some graphemes that cannot be composed will also  
>> work
>> to some degree (for example, opEquals).
>>
>> What I'm proposing (or think I'm proposing) is not exactly catering to
>> English and ASCII, what I'm proposing is simply not catering to more
>> complex languages such as Hebrew and Arabic.  What I'm trying to find  
>> is a
>> middle ground where most languages work, and the code is simple and
>> efficient, with possibilities to jump down to lower levels for  
>> performance
>> (i.e. switch to char[] when you know ASCII is all you are using) or jump
>> up to full unicode when necessary.
>>
>> Essentially, we would have three levels of types:
>>
>> char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that  
>> do
>> normalization to dchars, but do not handle perfectly all graphemes.   
>> Works
>> with any algorithm that deals with bidirectional ranges.  This is the
>> default string type, and the type for string literals.  Represented
>> internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which  
>> may
>> perform worse than string_t, but supports everything unicode supports.
>> May require a battery of specialized algorithms.
>>
>> * - name up for discussion
>>
>> Also note that phobos currently does *no* normalization as far as I can
>> tell for things like opEquals.  Two char[]'s that represent equivalent
>> strings, but not in the same way, will compare as !=.
>>
>> -Steve
>
> The above compromise provides zero benefit. The proposed default type  
> string_t is incorrect and will cause bugs. I prefer the standard lib to  
> not provide normalization at all and force me to use a 3rd party lib  
> rather than provide an incomplete implementation that will give me a  
> false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on  
this, I'm not well-versed in unicode, or even languages that require  
unicode.  The clear benefit I see is that with a string type which  
normalizes to canonical code points, you can use this in any algorithm  
without having it be unicode-aware for *most languages*.  At least, that  
is how I see it.  I'm looking at it as a code-reuse proposition.

It's like calendars.  There are quite a few different calendars in  
different cultures.  But most people use a Gregorian calendar.  So we have  
three options:

a) Use a Gregorian calendar, and leave the other calendars to a 3rd party  
library
b) Use a complicated calendar system where Gregorian calendars are treated  
with equal respect to all other calendars, none are the default.
c) Use a Gregorian calendar by default, but include the other calendars as  
a separate module for those who wish to use them.

I'm looking at my proposal as more of a c) solution.

Can you show how normalization causes subtle bugs?

> More over, Even if you ignore Hebrew as a tiny insignificant minority  
> you cannot do the same for Arabic which has over one *billion* people  
> that use that language.

I hope that the medium type works 'good enough' for those languages, with  
the high level type needed for advanced usages.  At a minimum, comparison  
and substring should work for all languages.

> I firmly believe that in accordance with D's principle that the default  
> behavior should be the correct & safe option, D should have the full  
> unicode type (utfstring_t above) as the default.
>
> You need only a subset of the functionality because you only use  
> English? For the same reason, you don't want the Unicode overhead? Use  
> an ASCII type instead. In the same vain, a geneticist should use a DNA  
> sequence type and not Unicode text.

Or French, or Spanish, or German, etc...

Look, even the lowest level is valid unicode, but if you want to start  
extracting individual graphemes, you need more machinery.  In 99% of  
cases, I'd think you want to use strings as strings, not as sequences of  
graphemes, or code-units.

-Steve