Proposal for fixing dchar ranges

Wed Mar 19 15:40:08 PDT 2014

Am Thu, 20 Mar 2014 01:55:08 +0400
schrieb Dmitry Olshansky <dmitry.olsh at gmail.com>:

> Well, turns out the Unicode standard ties equivalence to normalization 
> forms. In other words unless both your strings are normalized the same 
> way there is really no point in trying to compare them.
> 
> As for opaque type - we could have say String!NFC and String!NFD or 
> some-such. It would then make sure the normalization is the right one.

And I thought of going the slow route where normalized and
unnormalized strings can coexist and be compared. No NFD or
NFC, just UTF8 strings.

Pros:
+ Learning about normalization isn't needed to use strings
  correctly. And few people do that.
+ Strings don't need to be normalized. Every modification to
  data is bad, e.g. when said string is fed back to the
  source. Think about a file name on a file system where a
  different normalization is a different file.

Cons:
- Comparisons for already normalized strings are unnecessarily
  slow. Maybe the normalization form (NFC, NFD, mixed) could be
  stored alongside the string.

> Cool, consider yourself enlisted :)
> I reckon word and line breaking algorithms are piece of cake compared to 
> UCA. Given the power toys of CodepointSet and toTrie it shouldn't be 
> that hard to come up with prototype. Then we just move precomputed 
> versions of related tries to std/internal/ and that's it, ready for 
> public consumption.

Would a typical use case be to find the previous/next boundary
given a code unit index? E.g. the cursor sits on a word and
you want to jump to the start or end of it. Just iterating the
words and lines might not be too useful.

> >> D (or any library for that matter) won't ever have all possible
> >> tinkering that Unicode standard permits. So I expect D to be "done" with
> >> Unicode one day simply by reaching a point of having all universally
> >> applicable stuff (and stated defaults) plus having a toolbox to craft
> >> your own versions of algorithms. This is the goal of new std.uni.
> >
> > Sorting strings is a very basic feature, but as I learned now
> > also highly complex.  I expected some kind of tables for
> > download that would suffice, but the rules are pretty detailed.
> > E.g. in German phonebook order, ä/ö/ü has the same order as
> > ae/oe/ue.
> 
> This is tailoring, an awful thing that makes cultural differences what 
> they are in Unicode ;)
> 
> What we need first and furthermost DUCET based version (default Unicode 
> collation element tables).

Of course.

-- 
Marco