The Case Against Autodecode

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Thu Jun 2 23:32:34 PDT 2016


On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 06/02/2016 04:36 PM, tsbockman wrote:
> > Your examples will pass or fail depending on how (and whether) the
> > 'ö' grapheme is normalized.
> 
> And that's fine. Want graphemes, .byGrapheme wags its tail in that
> corner.  Otherwise, you work on code points which is a completely
> meaningful way to go about things. What's not meaningful is the random
> results you get from operating on code units.
> 
> > They only ever succeeds because 'ö' happens to be one of the
> > privileged graphemes that *can* be (but often isn't!) represented as
> > a single code point. Many other graphemes have no such
> > representation.
> 
> Then there's no dchar for them so no problem to start with.
> 
> s.find(c) ----> "Find code unit c in string s"
[...]

This is a ridiculous argument.  We might as well say, "there's no single
byte UTF-8 that can represent Ш, so that's no problem to start with" --
since we can just define it away by saying s.find(c) == "find byte c in
string s", and thereby justify using ASCII as our standard string
representation.

The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in
the general case.  It is adequate for a subset of characters -- just
like ASCII is also adequate for a subset of characters.  If you only
need to work with ASCII, it suffices to work with ubyte[]. Similarly, if
your work is restricted to only languages without combining diacritics,
then a range of dchar suffices. But a range of dchar is NOT good enough
in the general case, and arguing that it does only makes you look like a
fool.

Appealing to normalization doesn't change anything either, since only a
subset of base character + diacritic combinations will normalize to a
single code point. If the string has a base character + diacritic
combination doesn't have a precomposed code point, it will NOT fit in a
dchar. (And keep in mind that the notion of diacritic is still very
Euro-centric. In Korean, for example, a single character is composed of
multiple parts, each of which occupies 1 code point. While some
precomposed combinations do exist, they don't cover all of the
possibilities, so normalization won't help you there.)


T

-- 
Frank disagreement binds closer than feigned agreement.


More information about the Digitalmars-d mailing list