VLERange: a range in between BidirectionalRange and

Sat Jan 15 10:21:12 PST 2011

Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
> <lutger.blijdestijn at gmail.com> wrote:
> 
> > Steven Schveighoffer wrote:
> >
> > ...
> >>> I think a good standard to evaluate our handling of Unicode is to see
> >>> how easy it is to do things the right way. In the above, foreach would
> >>> slice the string grapheme by grapheme, and the == operator would  
> >>> perform
> >>> a normalized comparison. While it works correctly, it's probably not  
> >>> the
> >>> most efficient way to do thing however.
> >>
> >> I think this is a good alternative, but I'd rather not impose this on
> >> people like myself who deal mostly with English.  I think this should be
> >> possible to do with wrapper types or intermediate ranges which have
> >> graphemes as elements (per my suggestion above).
> >>
> >> Does this sound reasonable?
> >>
> >> -Steve
> >
> > If its a matter of choosing which is the 'default' range, I'd think  
> > proper
> > unicode handling is more reasonable than catering for english / ascii  
> > only.
> > Especially since this is already the case in phobos string algorithms.
> 
> English and (if I understand correctly) most other languages.  Any  
> language which can be built from composable graphemes would work.  And in  
> fact, ones that use some graphemes that cannot be composed will also work  
> to some degree (for example, opEquals).
> 
> What I'm proposing (or think I'm proposing) is not exactly catering to  
> English and ASCII, what I'm proposing is simply not catering to more  
> complex languages such as Hebrew and Arabic.  What I'm trying to find is a  
> middle ground where most languages work, and the code is simple and  
> efficient, with possibilities to jump down to lower levels for performance  
> (i.e. switch to char[] when you know ASCII is all you are using) or jump  
> up to full unicode when necessary.
> 
> Essentially, we would have three levels of types:
> 
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that do  
> normalization to dchars, but do not handle perfectly all graphemes.  Works  
> with any algorithm that deals with bidirectional ranges.  This is the  
> default string type, and the type for string literals.  Represented  
> internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which may  
> perform worse than string_t, but supports everything unicode supports.   
> May require a battery of specialized algorithms.
> 
> * - name up for discussion
> 
> Also note that phobos currently does *no* normalization as far as I can  
> tell for things like opEquals.  Two char[]'s that represent equivalent  
> strings, but not in the same way, will compare as !=.
> 
> -Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language. 

I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. 

You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.