Proposal for fixing dchar ranges

Dmitry Olshansky dmitry.olsh at gmail.com
Tue Mar 18 12:18:16 PDT 2014


18-Mar-2014 10:21, Marco Leise пишет:
> The Unicode standard is too complex for general purpose
> algorithms to do useful things on D strings. We don't see that
> however, since our writing systems are sufficiently well
> supported.

> As an inspiration I'll leave a string here that contains
> combined characters in Korean
> (http://decodeunicode.org/hangul_syllables)
> and Latin as well as full width characters that span 2
> characters in e.g. Latin, Greek or Cyrillic scripts
> (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):
>
> Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊
>
> (I used the "unfonts" package for the Hangul part)
>
> What I want to say is that for correct Unicode handling we
> should either use existing libraries or get a feeling for
> what the Unicode standard provides, then form use cases out of it.

There is ICU and very few other things, like support in OSX frameworks 
(NSString). Industry in general kinda sucks on this point but 
desperately wants to improve.

>
> For example when we talk about the length of a string we are
> actually talking about 4 different things:
>
>    - number of code units
>    - number of code points
>    - number of user perceived characters
>    - display width using a monospace font
>
> The same distinction applies for slicing, depending on use case.
>
> Related:
>    - What normalization do D strings use. Both Linux and
>      MacOS X use UTF-8, but the binary representation of non-ASCII
>      file names is different.

There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.

IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).

>    - How do we handle sorting strings?

Unicode collation algorithm and provide ways to tweak the default one.

> The topic matter is complex, but not difficult (as in rocket science).
> If we really want to find a solution, we should form an expert group
> and stop talking until we read the latest Unicode specs.

Well, I did. You seem motivated, would you like to join the group?

> They are a
> moving target. Don't expect to ever be "done" with full Unicode
> support in D.

The 6.x standard line seems pretty stable to me. There is a point in 
provding support that worth approaching. After that ROI is drooping 
steadily as the amount of work to specialize for each specific culture 
rises. At some point we can only talk about opening up ways to specialize.

D (or any library for that matter) won't ever have all possible 
tinkering that Unicode standard permits. So I expect D to be "done" with 
Unicode one day simply by reaching a point of having all universally 
applicable stuff (and stated defaults) plus having a toolbox to craft 
your own versions of algorithms. This is the goal of new std.uni.


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list