Questions about Unicode, particularly Japanese

Tue Jun 8 14:56:51 PDT 2010

On 2010-06-08 15:27:10 -0400, "Nick Sabalausky" <a at a.a> said:

> So, my questions:
> 
> 1. Am I correct in all of that?

Yes. Note that combining characters exist for a variety of glyphs. 
There is somewhere a "combining acute accent" that can be combined with 
a "e", so you could use two code points to write "é" if you wanted 
instead of the single code point "pre-combined" form.

> 2. Is there a proper way to encode that modifier character by itself? For
> instance, if you wanted to write "Japanese has a (the modifier by itself
> here) that changes a sound".

Sometime there is a separate (non-combining) character for that. For 
instance you have a non-combining acute accent as a standalone 
character. Perhaps you can use a combining character with a no-break 
space?

> 3. A text editor, for instance, is intended to treat something like (U+305D,
> U+3099) as a single character, right?

Yes. They are both equivalent, and they'll share the same Unicode 
normalization.

> 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
> compare as equal?

Yes, well, it depends on what you're trying to do. Say you're searching 
for "é" in a text editor, it should match both the normal and the 
combining version. In your code, it depends on what you want to do (if 
you want to replace U+305D U+3099 with U+305E, then obviously you 
search by code point).

I think the proper way to do this is to perform Unicode normalization 
on both strings before comparing code points.

> 5. Does Phobos/Tango correctly abide by whatever the answer to #4 is?

Probably not. But again, in some cases making a literal code-point 
search might be what you want.

It'd be interesting if someone could make a unicode normalizer in the 
form of a range in Phobos 2. That way you could compare both strings by 
comparing code points from the normalizer ranges, all this without 
having to create a normalized copy.

> 6. Are there other languages with similar things for which the answers to #3
> and #4 are different? (And if so, how does Phobos/Tango handle it?)

Not all combinations have a pre-combined form, so you can't always 
convert them to a single code point. But beside that, when there is a 
pre-combined form, they should be treated as equivalent.

> 7. I assume Unicode doesn't have any provisions for Furigana, right? I
> assume that would be outside the scope of Unicode, but I thought I'd ask.

I'm pretty sure furigana is out of scope.

Reference:
<http://en.wikipedia.org/wiki/Combining_character>
<http://en.wikipedia.org/wiki/Unicode_normalization>

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/