Proposal for fixing dchar ranges
Marco Leise
Marco.Leise at gmx.de
Wed Mar 19 15:40:08 PDT 2014
Am Thu, 20 Mar 2014 01:55:08 +0400
schrieb Dmitry Olshansky <dmitry.olsh at gmail.com>:
> Well, turns out the Unicode standard ties equivalence to normalization
> forms. In other words unless both your strings are normalized the same
> way there is really no point in trying to compare them.
>
> As for opaque type - we could have say String!NFC and String!NFD or
> some-such. It would then make sure the normalization is the right one.
And I thought of going the slow route where normalized and
unnormalized strings can coexist and be compared. No NFD or
NFC, just UTF8 strings.
Pros:
+ Learning about normalization isn't needed to use strings
correctly. And few people do that.
+ Strings don't need to be normalized. Every modification to
data is bad, e.g. when said string is fed back to the
source. Think about a file name on a file system where a
different normalization is a different file.
Cons:
- Comparisons for already normalized strings are unnecessarily
slow. Maybe the normalization form (NFC, NFD, mixed) could be
stored alongside the string.
> Cool, consider yourself enlisted :)
> I reckon word and line breaking algorithms are piece of cake compared to
> UCA. Given the power toys of CodepointSet and toTrie it shouldn't be
> that hard to come up with prototype. Then we just move precomputed
> versions of related tries to std/internal/ and that's it, ready for
> public consumption.
Would a typical use case be to find the previous/next boundary
given a code unit index? E.g. the cursor sits on a word and
you want to jump to the start or end of it. Just iterating the
words and lines might not be too useful.
> >> D (or any library for that matter) won't ever have all possible
> >> tinkering that Unicode standard permits. So I expect D to be "done" with
> >> Unicode one day simply by reaching a point of having all universally
> >> applicable stuff (and stated defaults) plus having a toolbox to craft
> >> your own versions of algorithms. This is the goal of new std.uni.
> >
> > Sorting strings is a very basic feature, but as I learned now
> > also highly complex. I expected some kind of tables for
> > download that would suffice, but the rules are pretty detailed.
> > E.g. in German phonebook order, ä/ö/ü has the same order as
> > ae/oe/ue.
>
> This is tailoring, an awful thing that makes cultural differences what
> they are in Unicode ;)
>
> What we need first and furthermost DUCET based version (default Unicode
> collation element tables).
Of course.
--
Marco
More information about the Digitalmars-d
mailing list