Proposal for fixing dchar ranges

Wed Mar 19 07:42:37 PDT 2014

Am Tue, 18 Mar 2014 23:18:16 +0400
schrieb Dmitry Olshansky <dmitry.olsh at gmail.com>:

> 18-Mar-2014 10:21, Marco Leise пишет:
> > The Unicode standard is too complex for general purpose
> > algorithms to do useful things on D strings. We don't see that
> > however, since our writing systems are sufficiently well
> > supported.
> 
> > As an inspiration I'll leave a string here that contains
> > combined characters in Korean
> > (http://decodeunicode.org/hangul_syllables)
> > and Latin as well as full width characters that span 2
> > characters in e.g. Latin, Greek or Cyrillic scripts
> > (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):
> >
> > Halfwidth / Ｆｕｌｌｗｉｄｔｈ, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊
> >
> > (I used the "unfonts" package for the Hangul part)
> >
> > What I want to say is that for correct Unicode handling we
> > should either use existing libraries or get a feeling for
> > what the Unicode standard provides, then form use cases out of it.
> 
> There is ICU and very few other things, like support in OSX frameworks 
> (NSString). Industry in general kinda sucks on this point but 
> desperately wants to improve.
>
> > For example when we talk about the length of a string we are
> > actually talking about 4 different things:
> >
> >    - number of code units
> >    - number of code points
> >    - number of user perceived characters
> >    - display width using a monospace font
> >
> > The same distinction applies for slicing, depending on use case.
> >
> > Related:
> >    - What normalization do D strings use. Both Linux and
> >      MacOS X use UTF-8, but the binary representation of non-ASCII
> >      file names is different.
> 
> There is no single normalization to fix on.
> D programs may be written for Linux only, for Mac-only or for both.

Normalizations C and D are the non lossy ones and as far as I
understood equivalent. So I agree.

> IMO we should just provide ways to normalize strings.
> (std.uni.normalize has 'normalize' for starters).

I wondered if anyone will actually read up on normalization
prior to touching Unicode strings. I didn't, Andrei didn't and
so on...
So I expect strA == strB to be common enough, just like floatA
== floatB until the news spread. Since == is supposed to
compare for equivalence, could we hide all those details in
an opaque string type and offer correct comparison functions?

> >    - How do we handle sorting strings?
> 
> Unicode collation algorithm and provide ways to tweak the default one.

I wish I didn't look at the UCA. Jeeeez...
But yeah, that's the way to go.
Big frameworks like Java added a Collate class with predefined
constants for several languages. That's too much work for us.
But the API doesn't need to preclude adding those.

> > The topic matter is complex, but not difficult (as in rocket science).
> > If we really want to find a solution, we should form an expert group
> > and stop talking until we read the latest Unicode specs.
> 
> Well, I did. You seem motivated, would you like to join the group?

Yes, I'd like to see a Unicode 6.x approved stamp on D.
I didn't know that you already wrote all the simple algorithms
for 2.064. Those would have been my candidates to work on, too.
Is there anything that can be implemented in a day or two? :)

> > They are a
> > moving target. Don't expect to ever be "done" with full Unicode
> > support in D.
> 
> The 6.x standard line seems pretty stable to me. There is a point in 
> provding support that worth approaching. After that ROI is drooping 
> steadily as the amount of work to specialize for each specific culture 
> rises. At some point we can only talk about opening up ways to specialize.
> 
> D (or any library for that matter) won't ever have all possible 
> tinkering that Unicode standard permits. So I expect D to be "done" with 
> Unicode one day simply by reaching a point of having all universally 
> applicable stuff (and stated defaults) plus having a toolbox to craft 
> your own versions of algorithms. This is the goal of new std.uni.

Sorting strings is a very basic feature, but as I learned now
also highly complex. I expected some kind of tables for
download that would suffice, but the rules are pretty detailed.
E.g. in German phonebook order, ä/ö/ü has the same order as
ae/oe/ue.

-- 
Marco