Questions about Unicode, particularly Japanese

Matti Niemenmaa see_signature at for.real.address
Tue Jun 8 12:42:52 PDT 2010


On 2010-06-08 22:27, Nick Sabalausky wrote:
<snip>
>
> 1. Am I correct in all of that?

Yes. In particular, the three-byteness of CJK characters is an 
often-cited reason to use UTF-16 instead of UTF-8.

> 2. Is there a proper way to encode that modifier character by itself? For
> instance, if you wanted to write "Japanese has a (the modifier by itself
> here) that changes a sound".

You can combine it with a space, but yes: that mark, called the dakuten 
or voicing mark, can be encoded by itself as U+309B.

I recommend http://rishida.net/scripts/uniview/ for searching through 
Unicode.

> 3. A text editor, for instance, is intended to treat something like (U+305D,
> U+3099) as a single character, right?

Yes, I'd say so. I suppose it could allow for removing only the modifier 
(or the modified), but that doesn't seem like it should be the default 
behaviour.

> 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
> compare as equal?

Yes. You might want to read about equivalence and normalization in Unicode:

http://en.wikipedia.org/wiki/Unicode_equivalence

> 5. Does Phobos/Tango correctly abide by whatever the answer to #4 is?

AFAIK, neither support normalization of any kind.

> 6. Are there other languages with similar things for which the answers to #3
> and #4 are different? (And if so, how does Phobos/Tango handle it?)

Factor has pretty good support for Unicode:

http://docs.factorcode.org/content/article-unicode.html

> 7. I assume Unicode doesn't have any provisions for Furigana, right? I
> assume that would be outside the scope of Unicode, but I thought I'd ask.

There's:

U+FFF9  INTERLINEAR ANNOTATION ANCHOR
U+FFFA  INTERLINEAR ANNOTATION SEPARATOR
U+FFFB  INTERLINEAR ANNOTATION TERMINATOR

But it's usually recommended to use some kind of ruby markup instead. See:

http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi


More information about the Digitalmars-d mailing list