Proposal for fixing dchar ranges

Marco Leise Marco.Leise at gmx.de
Mon Mar 17 23:21:49 PDT 2014


The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.

As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

(I used the "unfonts" package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.

For example when we talk about the length of a string we are
actually talking about 4 different things:

  - number of code units
  - number of code points
  - number of user perceived characters
  - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
  - What normalization do D strings use. Both Linux and
    MacOS X use UTF-8, but the binary representation of non-ASCII
    file names is different.
  - How do we handle sorting strings?

The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs. They are a
moving target. Don't expect to ever be "done" with full Unicode
support in D.

-- 
Marco



More information about the Digitalmars-d mailing list