Proposal for fixing dchar ranges

Wed Mar 19 14:55:08 PDT 2014

19-Mar-2014 18:42, Marco Leise пишет:
> Am Tue, 18 Mar 2014 23:18:16 +0400
> schrieb Dmitry Olshansky <dmitry.olsh at gmail.com>:
>
>>> Related:
>>>     - What normalization do D strings use. Both Linux and
>>>       MacOS X use UTF-8, but the binary representation of non-ASCII
>>>       file names is different.
>>
>> There is no single normalization to fix on.
>> D programs may be written for Linux only, for Mac-only or for both.
>
> Normalizations C and D are the non lossy ones and as far as I
> understood equivalent. So I agree.
>

Right, the KC & KD ones are really all about fuzzy matching and searching.

>> IMO we should just provide ways to normalize strings.
>> (std.uni.normalize has 'normalize' for starters).
>
> I wondered if anyone will actually read up on normalization
> prior to touching Unicode strings. I didn't, Andrei didn't and
> so on...
> So I expect strA == strB to be common enough, just like floatA
> == floatB until the news spread.

If that of any comfort other languages are even worse here. In C++ your 
are hopeless without ICU.

> Since == is supposed to
> compare for equivalence, could we hide all those details in
> an opaque string type and offer correct comparison functions?

Well, turns out the Unicode standard ties equivalence to normalization 
forms. In other words unless both your strings are normalized the same 
way there is really no point in trying to compare them.

As for opaque type - we could have say String!NFC and String!NFD or 
some-such. It would then make sure the normalization is the right one.

>>>     - How do we handle sorting strings?
>>
>> Unicode collation algorithm and provide ways to tweak the default one.
>
> I wish I didn't look at the UCA. Jeeeez...
> But yeah, that's the way to go.

Needless to say I had a nice jaw-dropping moment when I realized what 
elephant I have missed with our std.uni (somewhere in the middle of the 
work).

> Big frameworks like Java added a Collate class with predefined
> constants for several languages. That's too much work for us.
> But the API doesn't need to preclude adding those.

Indeed some kind of Collator is in order. On the use side of things it's 
simply a functor that compares strings. The fact that it's full of 
tables and the like is well hidden. The only thing above that is caching 
preprocessed strings, that maybe useful for databases and string indexes.

>>> The topic matter is complex, but not difficult (as in rocket science).
>>> If we really want to find a solution, we should form an expert group
>>> and stop talking until we read the latest Unicode specs.
>>
>> Well, I did. You seem motivated, would you like to join the group?
>
> Yes, I'd like to see a Unicode 6.x approved stamp on D.
> I didn't know that you already wrote all the simple algorithms
> for 2.064. Those would have been my candidates to work on, too.
> Is there anything that can be implemented in a day or two? :)
>

Cool, consider yourself enlisted :)
I reckon word and line breaking algorithms are piece of cake compared to 
UCA. Given the power toys of CodepointSet and toTrie it shouldn't be 
that hard to come up with prototype. Then we just move precomputed 
versions of related tries to std/internal/ and that's it, ready for 
public consumption.

>> D (or any library for that matter) won't ever have all possible
>> tinkering that Unicode standard permits. So I expect D to be "done" with
>> Unicode one day simply by reaching a point of having all universally
>> applicable stuff (and stated defaults) plus having a toolbox to craft
>> your own versions of algorithms. This is the goal of new std.uni.
>
> Sorting strings is a very basic feature, but as I learned now
> also highly complex.  I expected some kind of tables for
> download that would suffice, but the rules are pretty detailed.
> E.g. in German phonebook order, ä/ö/ü has the same order as
> ae/oe/ue.

This is tailoring, an awful thing that makes cultural differences what 
they are in Unicode ;)

What we need first and furthermost DUCET based version (default Unicode 
collation element tables).

-- 
Dmitry Olshansky