Questions about Unicode, particularly Japanese

Nick Sabalausky a at a.a
Tue Jun 8 12:27:10 PDT 2010


The "Wide character support in D" thread got me to question and double-check 
some of my assumptions about unicode. From double-checking the UTF-8 
encoding, and looking at the charts at ( http://www.unicode.org/charts/ ), I 
realized that Japanese, Chinese and Korean characters are almost entirely 
(if not entirely) 3 bytes on UTF-8. For some reason I had been under the 
impression that the Japanese -kanas and at least a few of the Chinese 
characters were 2 bytes on UTF-8. Turns out that's not the case. I thought 
I'd share that in case any one else didn't know. Also, FWIW, Cyrillic (ex, 
Russian, AIUI), and Greek appear to be primarily, if not entirely, 2 bytes 
in UTF-8.

But then I noticed something on the charts for the Japanese -kanas (ex: 
http://www.unicode.org/charts/PDF/U3040.pdf ). Umm, first of all, for those 
unfamiliar with Japanese: There are two phonetic alphabets, hiragana and 
katakana (in addition to the chinese characters), and they're based more on 
syllables than the individual sounds of western-style letters. Also, some of 
the sounds are formed by adding a modifier to a symbol for a similar sound. 
For instance: ? (U+305D, hiragana "so") is the sound "so", and to make "zo" 
you add what looks like a double-quote to it: ? (U+305E, hiragana "zo") (You 
may need to increase your font size to see it well). That same modifier 
converts most of the "s"'s to "z"'s, or any of the "h"'s to "b"'s, etc. And 
there's also another modifier that converts the "h"'s to "p"'s (looks like a 
little circle).

The thing is, there appears to also be Unicode code points for these 
modifiers by themselves (U+3099 and U+309A). Maybe I'm understanding it 
wrong, but according to Page 3 in the document I linked to above, it looks 
like these are intended to be used in conjunction with the regular letters 
in order to modify them. So, it seems that there are two valid ways to 
encode a single character like ? ("zo"): Either (U+305E) or (U+305D, 
U+3099).

I think these are what people call "combining characters" but every 
explanation of Unicode I've ever seen that actually mentions such things 
always just hand-waves it away with "oh, yea, and then there's something 
called 'combining characters' that can complicate things", and that's all 
they ever say.

So, my questions:

1. Am I correct in all of that?

2. Is there a proper way to encode that modifier character by itself? For 
instance, if you wanted to write "Japanese has a (the modifier by itself 
here) that changes a sound".

3. A text editor, for instance, is intended to treat something like (U+305D, 
U+3099) as a single character, right?

4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to 
compare as equal?

5. Does Phobos/Tango correctly abide by whatever the answer to #4 is?

6. Are there other languages with similar things for which the answers to #3 
and #4 are different? (And if so, how does Phobos/Tango handle it?)

7. I assume Unicode doesn't have any provisions for Furigana, right? I 
assume that would be outside the scope of Unicode, but I thought I'd ask.




More information about the Digitalmars-d mailing list