Proposal for fixing dchar ranges
Marco Leise
Marco.Leise at gmx.de
Mon Mar 17 23:21:49 PDT 2014
The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.
As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):
Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊
(I used the "unfonts" package for the Hangul part)
What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.
For example when we talk about the length of a string we are
actually talking about 4 different things:
- number of code units
- number of code points
- number of user perceived characters
- display width using a monospace font
The same distinction applies for slicing, depending on use case.
Related:
- What normalization do D strings use. Both Linux and
MacOS X use UTF-8, but the binary representation of non-ASCII
file names is different.
- How do we handle sorting strings?
The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs. They are a
moving target. Don't expect to ever be "done" with full Unicode
support in D.
--
Marco
More information about the Digitalmars-d
mailing list