VLERange: a range in between BidirectionalRange and

Sat Jan 15 12:46:11 PST 2011

Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo at bar.com> wrote:
> 
> > Steven Schveighoffer Wrote:
> 
> >>
> >> English and (if I understand correctly) most other languages.  Any
> >> language which can be built from composable graphemes would work.  And  
> >> in
> >> fact, ones that use some graphemes that cannot be composed will also  
> >> work
> >> to some degree (for example, opEquals).
> >>
> >> What I'm proposing (or think I'm proposing) is not exactly catering to
> >> English and ASCII, what I'm proposing is simply not catering to more
> >> complex languages such as Hebrew and Arabic.  What I'm trying to find  
> >> is a
> >> middle ground where most languages work, and the code is simple and
> >> efficient, with possibilities to jump down to lower levels for  
> >> performance
> >> (i.e. switch to char[] when you know ASCII is all you are using) or jump
> >> up to full unicode when necessary.
> >>
> >> Essentially, we would have three levels of types:
> >>
> >> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> >> string_t!T (string, wstring, dstring) -- Specialized string types that  
> >> do
> >> normalization to dchars, but do not handle perfectly all graphemes.   
> >> Works
> >> with any algorithm that deals with bidirectional ranges.  This is the
> >> default string type, and the type for string literals.  Represented
> >> internally by a single char[], wchar[] or dchar[] array.
> >> * utfstring_t!T -- specialized string to deal with full unicode, which  
> >> may
> >> perform worse than string_t, but supports everything unicode supports.
> >> May require a battery of specialized algorithms.
> >>
> >> * - name up for discussion
> >>
> >> Also note that phobos currently does *no* normalization as far as I can
> >> tell for things like opEquals.  Two char[]'s that represent equivalent
> >> strings, but not in the same way, will compare as !=.
> >>
> >> -Steve
> >
> > The above compromise provides zero benefit. The proposed default type  
> > string_t is incorrect and will cause bugs. I prefer the standard lib to  
> > not provide normalization at all and force me to use a 3rd party lib  
> > rather than provide an incomplete implementation that will give me a  
> > false sense of correctness and cause very subtle and hard to find bugs.
> 
> I feel like you might be exaggerating, but maybe I'm completely wrong on  
> this, I'm not well-versed in unicode, or even languages that require  
> unicode.  The clear benefit I see is that with a string type which  
> normalizes to canonical code points, you can use this in any algorithm  
> without having it be unicode-aware for *most languages*.  At least, that  
> is how I see it.  I'm looking at it as a code-reuse proposition.
> 
> It's like calendars.  There are quite a few different calendars in  
> different cultures.  But most people use a Gregorian calendar.  So we have  
> three options:
> 
> a) Use a Gregorian calendar, and leave the other calendars to a 3rd party  
> library
> b) Use a complicated calendar system where Gregorian calendars are treated  
> with equal respect to all other calendars, none are the default.
> c) Use a Gregorian calendar by default, but include the other calendars as  
> a separate module for those who wish to use them.
> 
> I'm looking at my proposal as more of a c) solution.
> 

The calendar example is a very good one. What you're saying equivalent to saying is that most people use Gregorian but for efficiency and other reasons you want to not implement feb 29th. 

> Can you show how normalization causes subtle bugs?
> 

That was already shown by Michel and Spir where the equality operator is incorrect due to diacritics (the example with exposé). Your solution makes this far worse since it will reduce the bug to far less cases making the problem far less obvious. 
One would test with exposé which will work and another test (let's say in Hebrew) and that will *not* work and unless the programmer is a Unicode expert (Which is very unlikely) the programmer is left scratching his head.

> > More over, Even if you ignore Hebrew as a tiny insignificant minority  
> > you cannot do the same for Arabic which has over one *billion* people  
> > that use that language.
> 
> I hope that the medium type works 'good enough' for those languages, with  
> the high level type needed for advanced usages.  At a minimum, comparison  
> and substring should work for all languages.
> 

As I explained above, 'good enough' in this case is far worse because it masks the problem.  Also, If you want comparison to work in all languages including Hebrew/Arabic than it simply isn't good enough.

> > I firmly believe that in accordance with D's principle that the default  
> > behavior should be the correct & safe option, D should have the full  
> > unicode type (utfstring_t above) as the default.
> >
> > You need only a subset of the functionality because you only use  
> > English? For the same reason, you don't want the Unicode overhead? Use  
> > an ASCII type instead. In the same vain, a geneticist should use a DNA  
> > sequence type and not Unicode text.
> 
> Or French, or Spanish, or German, etc...
> 
> Look, even the lowest level is valid unicode, but if you want to start  
> extracting individual graphemes, you need more machinery.  In 99% of  
> cases, I'd think you want to use strings as strings, not as sequences of  
> graphemes, or code-units.
> 
> -Steve

I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization. 

As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.