VLERange: a range in between BidirectionalRange and
foo at bar.com
Sat Jan 15 10:21:12 PST 2011
Steven Schveighoffer Wrote:
> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn
> <lutger.blijdestijn at gmail.com> wrote:
> > Steven Schveighoffer wrote:
> >
> > ...
> >>> I think a good standard to evaluate our handling of Unicode is to see
> >>> how easy it is to do things the right way. In the above, foreach would
> >>> slice the string grapheme by grapheme, and the == operator would
> >>> perform
> >>> a normalized comparison. While it works correctly, it's probably not
> >>> the
> >>> most efficient way to do thing however.
> >>
> >> I think this is a good alternative, but I'd rather not impose this on
> >> people like myself who deal mostly with English. I think this should be
> >> possible to do with wrapper types or intermediate ranges which have
> >> graphemes as elements (per my suggestion above).
> >>
> >> Does this sound reasonable?
> >>
> >> -Steve
> >
> > If its a matter of choosing which is the 'default' range, I'd think
> > proper
> > unicode handling is more reasonable than catering for english / ascii
> > only.
> > Especially since this is already the case in phobos string algorithms.
> English and (if I understand correctly) most other languages. Any
> language which can be built from composable graphemes would work. And in
> fact, ones that use some graphemes that cannot be composed will also work
> to some degree (for example, opEquals).
> What I'm proposing (or think I'm proposing) is not exactly catering to
> English and ASCII, what I'm proposing is simply not catering to more
> complex languages such as Hebrew and Arabic. What I'm trying to find is a
> middle ground where most languages work, and the code is simple and
> efficient, with possibilities to jump down to lower levels for performance
> (i.e. switch to char[] when you know ASCII is all you are using) or jump
> up to full unicode when necessary.
> Essentially, we would have three levels of types:
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that do
> normalization to dchars, but do not handle perfectly all graphemes. Works
> with any algorithm that deals with bidirectional ranges. This is the
> default string type, and the type for string literals. Represented
> internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which may
> perform worse than string_t, but supports everything unicode supports.
> May require a battery of specialized algorithms.
> * - name up for discussion
> Also note that phobos currently does *no* normalization as far as I can
> tell for things like opEquals. Two char[]'s that represent equivalent
> strings, but not in the same way, will compare as !=.
> -Steve
The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.
More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.
I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default.
You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.
More information about the Digitalmars-d
mailing list